I am new to R. I have downloaded the XML with all Bioprojects from the NCBI. The file is 1GB in size. I started with this:
setwd("C://Users/USER/Desktop/")
xmlfile = xmlParse("bioproject.xml")
root = xmlRoot(xmlfile)
xmlName(root)
[1] "PackageSet"
xmlSize(root)
[1] 357935
So there are 357935 projects in the NCBI. Here I'm looking at project 34:
> root[[34]]
<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA44" archive="NCBI" id="44"/>
      </ProjectID>
      <ProjectDescr>
        <Name>Bartonella quintana str. Toulouse</Name>
        <Title>Causes bacillary angiomatosis</Title>
        <Description><P><B><I>Bartonella quintana</I> str. Toulouse</B>. <I>Bartonella quintana</I> str. Toulouse was isolated from human blood in Toulouse, France in 1993. There is evidence of extensive genome reduction in comparison to other <I>Bartonella</I> species which may be associated with the limited host range of <I>Bartonella quintana</I>.</Description>
        <ExternalLink category="Other Databases" label="GOLD">
          <URL>http://genomesonline.org/cgi-bin/GOLD/bin/GOLDCards.cgi?goldstamp=Gc00191</URL>
        </ExternalLink>
        <Publication date="2004-06-24T00:00:00Z" id="15210978" status="ePublished">
          <Reference/>
          <DbType>ePubmed</DbType>
        </Publication>
        <ProjectReleaseDate>2004-06-25T00:00:00Z</ProjectReleaseDate>
        <LocusTagPrefix assembly_id="GCA_000046685" biosample_id="SAMEA3138248">BQ</LocusTagPrefix>
      </ProjectDescr>   
      <ProjectType>
        ...
        ...
      </ProjectType>
    </Project>
    <Submission submitted="2003-03-20">
      ...
      ...
    </Submission>
    <ProjectLinks>
      ...
      ...
    </ProjectLinks>
  </Project>
</Package>
What I need is to obtain ALL the <ProjectID> values (in this case, PRJNA44) in the entire XML file, ONLY IF in <Description> within <ProjectDescr> of each project there exist the text "isolated from human blood" (or "blood", if this makes the script simpler). Alternatively, if it makes it simpler, instead of obtaining the ProjectID, I can obtain the <URL> value within <ExternalLink within <ProjectDescr>.
I don't know how (or whether) to use the xpath function (or xpathApply or getNodeSet or xpathSApply). Thank you for the help.
 
    