I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:
import xml.etree.cElementTree as cElementTree                             
def main(): 
   context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
   context = iter(context)
   event, root = context.__next__()
   for event, elem in context:
     if event == "start":
         if elem.tag == 'group': 
            elem.tail = None
            print ( elem.text)
         if elem.tag in ['group']:
            root.clear()                                               
main()
But it gave me following error in this line for event, elem in context:
xml.etree.ElementTree.ParseError: not well-formed (invalid token)
To handle this error, I tried to use lxml with recover=True for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.
Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.
What can I use to avoid invalid characters and parse this large file?
 
    