I would like to be able to process a paragraph by sentence in xml format that does not specifiy sentences. My input looks like this:
<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
I wish my input was something that looks more like:
<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
<s>Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s><s>Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s></p>
So that I can extract these whole like:
<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s>
<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s>
My test code is:
from lxml import etree
if __name__=="__main__":
  xml1 = '''<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
'''
  print xml1
  root = etree.XML(xml1)
  sentences_info = []
  for sentence in root:
    # I want to do more fun stuff here with the result
    sentence_text = sentence.text
    ref_ids = []
    for reference in sentence.getchildren():
        if 'rid' in reference.attrib.keys():
            ref_id = reference.attrib['rid']
            ref_ids.append(ref_id)
    sent_par = {'reference_ids': ref_ids,'text': sentence_text}
    sentences_info.append(sent_par)
    print sent_par
 
     
     
    