I have a memory problem with parsing the large XML file.
The file looks like (just first few rows):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
  <cmData type="actual">
    <header>
      <log dateTime="2019-02-05T19:00:18" action="created" appInfo="ActualExporter">InternalValues are used</log>
    </header>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M-1" id="366">
      <p name="linkedMrsiteDN">PL/TE-2/p>
      <p name="name">Name of street</p>
      <list name="PiOptions">
        <p>0</p>
        <p>5</p>
        <p>2</p>
        <p>6</p>
        <p>7</p>
        <p>3</p>
        <p>9</p>
        <p>10</p>
      </list>
      <p name="btsName">4251</p>
      <p name="spareInUse">1</p>
    </managedObject>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M10" id="958078">
      <p name="linkedMrsiteDN">PLMN-PLMN/MRSITE-138</p>
      <p name="name">Street 2</p>
      <p name="btsName">748</p>
      <p name="spareInUse">3</p>
    </managedObject>
    <managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M21" id="1482118">
      <p name="name">Stree 3</p>
      <p name="btsName">529</p>
      <p name="spareInUse">4</p>
    </managedObject>
  </cmData>
</raml>
And I am using xml eTree Element parser, but with a file over 4GB and 32 GB of RAM on machine, I'm running out of memory. Code I'm using:
def parse_xml(data, string_in, string_out):
    """
    :param data: xml raw file that need to be processed and prased
    :param string_in: string that should exist in distinguish name
    :param string_out: string that should not exist in distinguish name
    string_in and string_out represent the way to filter level of parsing (site or cell)
    :return: dictionary with all unnecessary objects for selected technology
    """
    version_dict = {}
    for child in data:
        for grandchild in child:
            if isinstance(grandchild.get('distName'), str) and string_in in grandchild.get('distName') and string_out not in grandchild.get('distName'):
                inner_dict = {}
                inner_dict.update({'class': grandchild.get('class')})
                inner_dict.update({'version': grandchild.get('version')})
                for grandgrandchild in grandchild:
                    if grandgrandchild.tag == '{raml20.xsd}p':
                        inner_dict.update({grandgrandchild.get('name'): grandgrandchild.text})
                    elif grandgrandchild.tag == '{raml20.xsd}list':
                        p_lista = []
                        for gggchild in grandgrandchild:
                            if gggchild.tag == '{raml20.xsd}p':
                                p_lista.append(gggchild.text)
                            inner_dict.update({grandgrandchild.get('name'): p_lista})
                            if gggchild.tag == '{raml20.xsd}item':
                                for gdchild in gggchild:
                                    inner_dict.update({gdchild.get('name'): gdchild.text})
                    version_dict.update({grandchild.get('distName'): inner_dict})
    return version_dict
I have tried with iterparse, with root.clear(), but nothing really helps. I heard that DOM parsers are slower ones, but SAX gives me an error:
ValueError: unknown url type: '/development/data/raml20.dtd'
Not sure why. If anyone has any suggestion on how to improve way and performance, I will be really thankful. I there is a need for bigger XML samples, I am willing to provide it.
Thanks in advance.
EDIT:
Code I tried after the first answer:
import xml.etree.ElementTree as ET
def parse_item(d):
#     print(d)
#     print('---')
    a = '<root>'+ d + '</root>'
    tree = ET.fromstring(a)
    outer_dict_yield = {}
    for elem in tree:
        inner_dict_yield = {}
        for el in elem:
            if isinstance(el.get('name'), str):
                inner_dict_yield.update({el.get('name'): el.text})
            inner_dict.update({'version': elem.get('version')})
#                 print (inner_dict_yield)
    outer_dict_yield.update({elem.get('distName'): inner_dict_yield})
#     print(outer_dict_yield)
    return outer_dict_yield
def read_a_line(file_object):
    while True:
        data = file_object.readline()
        if not data:
            break
        yield data
min_data = ""
inside = False
f = open('/development/file.xml')
outer_main = {}
counter = 1
for line in read_a_line(f):
    if line.find('<managedObject') != -1:
        inside = True
    if inside:
        min_data += line
    if line.find('</managedObject') != -1:
        inside = False
        a = parse_item(min_data)
        counter = counter + 1
        outer_main.update({counter: a})
        min_data = ''
 
     
    