Consider the following XML example
library(xml2)
myxml <- read_xml('
<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>
')
Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name and hobby.
However, as you see, there is an alignment problem because hobby is missing in the second node and John has two hobbies.
in R, I know how to extract specific values one at a time, for instance using xml2 as follows:
myxml%>%
xml_find_all("//name") %>%
xml_text()
myxml%>%
xml_find_all("//hobby") %>%
xml_text()
but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a | the two hobbies of John):
# A tibble: 2 × 3
name hobby skill
<chr> <chr> <chr>
1 John tennis|golf python
2 Robert <NA> R
In R, I would prefer a solution using xml2 and dplyr. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.
Thanks!
EDIT: thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!