[SOLVED] Scraping links off websites – R

Issue

This Content is from Stack Overflow. Question asked by bandcar

I’m using RSelenium to get the page source off the archive.org website so I can scrape the links using rvest.

remote_driver = rsDriver(browser = 'firefox',
                         verbose = F,
                         port = free_port())
rd = remote_driver$client
rd$open()
rd$navigate('https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories')
rd$maxWindowSize()

html = read_html(rd$getPageSource()[[1]])

get_links <- html %>%
  html_nodes('.categories-grid__category a') %>%
  html_attr('href') %>%
  paste0('https://web.archive.org', .)

It successfully scrapes the link of the original website, but misses the portion belonging to the archive.org website.

This is what the first example returns:

https://web.archive.orghttp://www.bjjcompsystem.com/tournaments/1869/categories/2053146

But it’s missing the the unique identifier:

/web/20220913024354/

This is what the full link should look like:
https://web.archive.org/web/20220913024354/https://www.bjjcompsystem.com/tournaments/1869/categories/2053146

How do I get the links??

How the scraped links should look:

https://web.archive.org/web/20220913024354/https://www.bjjcompsystem.com/tournaments/1869/categories/2053146

https://web.archive.org/web/20220913024425/https://www.bjjcompsystem.com/tournaments/1869/categories/2053150

https://web.archive.org/web/20220913024456/https://www.bjjcompsystem.com/tournaments/1869/categories/2053154

etc.



Solution

I am not sure what you mean. Like this?

library(tidyverse)
library(rvest)

"https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories" %>% 
  read_html() %>%  
  html_elements(".categories-grid__category a") %>% 
  html_attr("href") %>%  
  paste0("https://web.archive.org", .)

[1] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053146"
[2] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053150"
[3] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053154"
[4] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053158"
[5] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053162"
[6] "https://web.archive.org/web/20220913022021/http://www.bjjcompsystem.com/tournaments/1869/categories/2053166"


This Question was asked in StackOverflow by bandcar and Answered by Tom Hoel It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?