I have used the XML package to parse both HTML and XML before, and have a rudimentary grasp of xPath. However I've been asked to consider XML data where the important bits are determined by a combination of text and attributes of the elements themselves, as well as those in related nodes. I've never done that. For example
[updated example, slightly more expansive]
<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>
I would like to extract all author names where: 1) the location element has a text value that contains "NY" 2) the author element does NOT contain a "ref" attribute; that is where ref is not present in the author tag
I will ultimately need to concatenate the extracted authors together within a given bookstore, so that my resulting data frame is one row per store. I'd like to preserve the bookstore id as an additional field in my data frame so that I can uniqely reference each store. Since only the first bokstore is in NY, results from this simple example would look something like:
1 Jane Smith John Doe Karl Pearson William Gosset
If another bookstore contained "NY" in its location, it would comprise the second row, and so forth.
Am I asking too much of R to parser under these convoluted conditions?