My ultimate goal is to web scrape the Standings page of The Puzzled Pint for Montreal.
I think I need to dynamically scrape (e.g. use RSelenium) since the table I'm interested in is a JavaScript iframe - part of a web page that displays content independent of its container.
Some have suggested that scraping directly from the source of these iframes is the way to go. I used the web developer Inspector tool in my firefox browser to find the src= which happens to be Google Sheets.
First, use robots.txt to make sure we're allowed to scrape it from Google Sheets:
library(robotstxt)
paths_allowed("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")
Now that I know I have permission, I tried the RCurl package. It's simple to get the first page:
library(RCurl)
sheet <- getForm("https://docs.google.com/spreadsheet/pub", hl = "en_US", key = "1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U", output = "csv", .opts = list(followlocation = TRUE, verbose = TRUE, ssl.verifypeer = FALSE))
df <- read.csv(textConnection(sheet))
head(df)
However, when you click any of the other Month/Year links on this Google Sheet the gid= of the url changes. For example, for October 2018 it's now:
https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=1367583807
I'm not sure if it's possible to scrape widget's with RCurl? If it is I'd love to hear how.
So it looks like I will most likely need to use RSelenium to do this.
library(RSelenium)
# connect to a running server
remDr <- remoteDriver(
remoteServerAddr = "192.168.99.100",
port = 4445L
)
remDr$open()
# navigate to the site of interest
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")
My problem is trying to get the HTML for the table on this page, the following was suggested on SO but doesn't work for me (It doesn't return the expected output, just Month/Year metadata from the links/elements)?
library(XML)
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)
I believe I need to navigate to the inner frame but not sure how to do this?
For example, when looking for the CSS tag for this table with SelectorGadget in chrome it gives me a warning that it's an iframe and to be able to select within it I need to click a link.
When I use this link with readHTMLTable() I get the correct information I want:
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers=false&gid=203220308")
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)
This presents a problem as I need to use RSelenium to navigate through the different pages/tables of the previous link (the iframe widget):
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")
To navigate through the different pages/tables I use SelectorGadget to find the CSS tags
# find all elements/links
webElems <- remDr$findElements(using = "css", ".switcherItem")
# Select the first link (October 2018)
webElem_01 <- webElems[[1]]
Then using TightVNC viewer I verified I was highlighting the correct element then "click" the element (in this case the October 2018 link).
webElem_01$highlightElement()
webElem_01$clickElement()
Since I can see that the page changed on TightVNC I assume there would be no more steps required before capturing/scraping here but as mentioned I need a way of programmatically navigating to the inner iframe of each of these pages.
UPDATE
Okay I figured out how to navigate to the inner frame using the remDr$switchToFrame() command but I cannot seem to figure out how to navigate back to the outer frame in order to "click" the next link and repeat the process. My current hacky attempt would involve me navigating back to the main page and repeating this process many times:
# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
# look for table
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")
# switch to table
remDr$switchToFrame(tableElem)
# parse html
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)
# how do I switch back to the outer frame?
# the remDr$goBack() command doesn't seem to do this
# workaround is to navigate back to the main page then navigate back to the second page and repeat process
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
webElems <- remDr$findElements(using = "css", ".switcherItem")
webElem_01 <- webElems[[1]]
webElem_01$clickElement()
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")
# switch to table
remDr$switchToFrame(tableElem)
# parse html
doc2 <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc2)



