Scraping large number of similar webpages – can I wrap read_html() in an apply() function in R

By Canovice

I am scraping a website that features a large number of dropdown items on the page that changes the values in the table I am scraping. The URLs are dictated by the dropdown values, and I was able to create a vector of all of the URLs I want to scrape. This vector is in the following form:

vectorsURL = c("http:www.mywebsite.com/page1/stats.html",
               "http:www.mywebsite.com/page2/stats.html",
               "http:www.mywebsite.com/page3/stats.html",
               "http:www.mywebsite.com/page4/stats.html",
               "http:www.mywebsite.com/page5/stats.html",
               "http:www.mywebsite.com/page6/stats.html")

The full vector is rather long, 25,000 URLs. My current approach for scraping all of these pages is the following, featuring a bit of piping which I’m trying to get in the habit of doing:

all_data = c()
for(i in 1:length(vectorsURL)) {
    my_URL = vectorsURL[i]
    scraped_page = my_URL %>%
                     read_html() %>%
                     html_nodes('table') %>%
                     extract(3) %>%
                     html_table() %>%
                     as.data.frame()
    all_data = rbind(all_data, scraped_page)
}

At roughly 2 seconds per page, for 25000 pages this will take upwards of 13-14 hours. I was wondering if this could be done quicker using apply functions. I tried using lapply in the following manner (I tested this on a subset of only 50 URLs) with the following code:

b = lapply(temp, FUN = function(x) x %>% read_html() %>% html_nodes('table') %>% extract(3) %>% html_table() %>% as.data.frame())

However this approach took ~100 seconds, which is as long as it would take in the for loop. Any thoughts on speeding this up would be greatly appreciated, even if it involves further parallelization in R in order to read more pages in a shorter amount of time. Thanks!

Source: Stack Overflow

    

Share it with your friends!

    Fatal error: Uncaught Exception: 12: REST API is deprecated for versions v2.1 and higher (12) thrown in /home/content/19/9652219/html/wp-content/plugins/seo-facebook-comments/facebook/base_facebook.php on line 1273