I have a heavily nested xml that I'm trying to convert to a data frame object.
attached failed attempts bellow.
input : johnny.xml file, contains the following text-
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
  <source/>
  <date/>
  <key/>
  <document>
    <id>2301222206</id>
    <infon key="tt_curatable">no</infon>
    <infon key="tt_version">1</infon>
    <infon key="tt_round">1</infon>
    <passage>
      <offset>0</offset>
      <text>Johnny likes pizza and chocolate, he lives in Italy with Emily.</text>
      <annotation id="1">
        <infon key="type">names</infon>
        <infon key="identifier">first_name</infon>
        <infon key="annotator">annotator_1</infon>
        <infon key="updated_at">2023-01-22T22:12:56Z</infon>
        <location offset="0" length="6"/>
        <text>Johnny</text>
      </annotation>
      <annotation id="3">
        <infon key="type">food</infon>
        <infon key="identifier"></infon>
        <infon key="annotator">annotator_2</infon>
        <infon key="updated_at">2023-01-22T22:13:51Z</infon>
        <location offset="13" length="19"/>
        <text>pizza and chocolate</text>
      </annotation>
      <annotation id="4">
        <infon key="type">location</infon>
        <infon key="identifier">europe</infon>
        <infon key="annotator">annotator_2</infon>
        <infon key="updated_at">2023-01-22T22:14:05Z</infon>
        <location offset="46" length="5"/>
        <text>Italy</text>
      </annotation>
      <annotation id="2">
        <infon key="type">names</infon>
        <infon key="identifier">first_name</infon>
        <infon key="annotator">annotator_1</infon>
        <infon key="updated_at">2023-01-22T22:13:08Z</infon>
        <location offset="57" length="5"/>
        <text>Emily</text>
      </annotation>
    </passage>
  </document>
</collection>
failed attempts:
- with lxml
 
from lxml import objectify
root = objectify.parse('johnny.xml').getroot()
data=[]
for i in range(len(root.getchildren())):
    data.append([child.text for child in root.getchildren()[i].getchildren()])
df = pd.DataFrame(data)
result -
    0   1   2   3   4
0   None    None    None    None    None
1   None    None    None    None    None
2   None    None    None    None    None
3   2301222206  no  1   1   None
- this solution using recursive function (2nd answer) result -
 
id  infon-1 infon-2 infon-3 infon-key-1 infon-key-2 infon-key-3 passage-offset  passage-text    passage-annotation-id-1 ... passage-annotation-location-offset-3    passage-annotation-location-offset-4    passage-annotation-location-length-1    passage-annotation-location-length-2    passage-annotation-location-length-3    passage-annotation-location-length-4    passage-annotation-text-1   passage-annotation-text-2   passage-annotation-text-3   passage-annotation-text-4
0   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   2301222206  no  1   1   tt_curatable    tt_version  tt_round    0   Johnny likes pizza and chocolate, he lives in Italy with Emily. 1   ... 46  57  6   19  5   5   Johnny  pizza and chocolate Italy   Emily
    4 rows × 56 columns
- with same question, first answer using pandas_read_xml library
 
import pandas_read_xml as pdx
p2 = 'Johnny.xml'
df = pdx.read_xml(p2, ['collection'])
df = pdx.fully_flatten(df)
df
result generated 47 rows, again was not what I was looking for.
- also tried with beautifulsoup as suggested here
 
Thank you!
