Manipulate strings from web-scraped data

Question

I am trying to scrape the data from a webpage and I have trouble manipulating the strings. If you visit the page, you'll realize that this is a website written in French. I am trying to get the data in tabular format at the bottom of the page. In French, thousand separators are either . or spaces, which are used on the webpage.

Here is my code to scrap the values in the second column:

library(rvest)

link <- read_html("http://perspective.usherbrooke.ca/bilan/servlet/BMTendanceStatPays?langue=fr&codePays=NOR&codeTheme=1&codeStat=SP.POP.TOTL")

link %>%
   html_nodes(".tableauBarreDroite") %>%
   html_text() -> pop

head(pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

The values in the pop vector contain the expected spaces with the unexpected Â. I tried the following to remove the spaces:

new.pop <- gsub(pattern = " ", replacement = "", x = pop)

head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

The spaces are still present in the new.pop variable. I also tried to remove tabs instead:

new.pop <- gsub(pattern = "\n", replacement = "", x = pop)

head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

As you can see, the spaces are not going away. Do you have any idea what I should do to transform pop vector into a numeric vector after removing the unwanted characters?

Looks like a language/locale issue: Take a look at http://stackoverflow.com/questions/13575180/how-to-change-language-settings-in-r to get started. — Norbert, Nov 13 '15 at 05:06

score 1 · Accepted Answer · answered Nov 13 '15 at 10:51

1

just a tip, you should use this:

gsub(pattern="\\s",replacement="",x=pop) or
gsub(pattern=".\\s",replacement="@",x=pop)

because space is a special character.

Best,

Robert

answered Nov 13 '15 at 10:51

Róbert Herczeg

180
1

It works! Thank you very much. `\s` is the regular expression for whitespace, right? What about `@`? – SavedByJESUS Nov 13 '15 at 13:09
Also, how could I remove both the `whitespace` and the `Â` character in the same `gsub` call? I tried `gsub("[\\sÂ]", "", pop)`, but it did not work. It only removed the `Â` character. – SavedByJESUS Nov 13 '15 at 14:32
1

this should remove the Â and the whitespace: gsub(pattern=".\\s",replacement="@",x=pop) – Róbert Herczeg Nov 13 '15 at 20:08

Manipulate strings from web-scraped data

1 Answers1