I have been fighting with this for quite a long time and cannot get it to work, so I am posting here. I'm not an advanced R user, but I'm learning and slowly getting onward. I have not found an example from Stackoverflow that I could adapt to this, the examples seem to have a different structure with no need to loop through each higher level attribute for each node. Or that's how I understand the difference now. The question is similar to this, but the file structure is different. For now I basically used this example.
Let's say I have large amount of small XML files with the structure presented below. They have names like file1.xml, file2.xml and so on. So file1.xml would be:
<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person1">
<WORD>word1</WORD>
<WORD>word2</WORD>
<WORD>word3</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person2">
<WORD>word4</WORD>
<WORD>word5</WORD>
<WORD>word6</WORD>
</SUBNODE>
</NODE>
And then file2.xml would be:
<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person3">
<WORD>word7</WORD>
<WORD>word8</WORD>
<WORD>word9</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person4">
<WORD>word10</WORD>
<WORD>word11</WORD>
<WORD>word12</WORD>
</SUBNODE>
</NODE>
And I would like to turn these into a data frame like this:
Filename   Speaker   Word
file1      person1   word1
file1      person1   word2
file1      person1   word3
file1      person2   word4
file1      person2   word5
file1      person2   word6
file2      person3   word7
file2      person3   word8
file2      person3   word9
file2      person4   word10
file2      person4   word11
file2      person4   word12
I can get the listing of all the words into one data frame with this:
library(XML)
library(plyr)
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
    doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
    Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
    return(data.frame(Word))
})
The content of "dat" is now a word list, as it should be. But no matter what I try I cannot get other data added into it. I have tried to add things there like:
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
    doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
    Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
    Speaker <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']", xmlGetAttr, "SPEAKER")        
    return(data.frame(Word, Speaker))
})
But then the dataframe is not correct, as it doesn't associate the right speaker with the right word.
Word    Speaker
word1   person1
word2   person2
word3   person1
word4   person2
word5   person1
word6   person2
word7   person3
word8   person4
word9   person3
word10  person4
word11  person3
word12  person4
Then I also frequently get errors like:
"Error in UseMethod("xmlValue") : 
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
Or then I get an error that these are of different length, which they of course are, as there are fewer speakers than there are words. There are many things I have tried, but I posted here only my "most successful" approaches. I understand that I would need a function that sort of matches each word with the speaker attribute in the above node, just extracting them into their own list doesn't help, I guess now it's just luck that in this example the number of speakers and words are matching so they were put together like in the data frame above.
And then I would still need to get the filenames into one column, as they contain a piece of information that I don't have inside the XML files themselves. This is anyway the least important aspect of my question. The actual files I work with are much more complex, that's why I have in the file sort of unnecessary structures like SUBNODE TYPE, etc.
Thank you for your help!