My objective is to use the library(tm) toolkit on a pretty big word document. The word document has sensible typography, so we have h1 for the main sections, some h2and h3 subheadings. I want to compare and text mine each section (the text below each h1 - the subheadings is of little importance - so they can be included or excluded.)
My strategy is to export the worddocument to html and then use the rvestpacakge to extract the paragraphs.
library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')
nodes <- file %>%
rvest::html_nodes("h1>p") %>%
rvest::html_text()
I can extract all the <p>with html_nodes("p"), but thats just one big soup. I need to analize each h1 separately.
The best would probably be a list, with a vector of p tags for each h1 heading. And maybe a loop with somehting like for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i])) (which is not working).
Bonus if there is a way to tidy words html from within rvest