I have a 20gb bz2 xml file. the format is like this:
<doc id="1" url="https://www.somepage.com" title="some page">
text text text ....
</doc>
I need to process it to tsv file in this format:
id<tab>url<tab>title<tab>processed_texts
What is the most efficient way of doing it in python and java and what are the differences (memory efficiency and speed wise). Basically I want to do this:
read bz2 file
read the xml file element by element
for each element
    retrieve id, url, title and text
    print_to_file(id<tab>url<tab>title<tab>process(text))
Thanks for your answers in advance.
UPDATE1 (Based on @Andreas suggestions):
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader xmlReader = factory.createXMLStreamReader(in);
xmlReader.nextTag(); 
    if (! xmlReader.getLocalName().equals("doc")) {
        xmlReader.nextTag(); }
        String id      = xmlReader.getAttributeValue(null, "id");
        String url     = xmlReader.getAttributeValue(null, "url");
        String title   = xmlReader.getAttributeValue(null, "title");
        String content = xmlReader.getElementText();
        out.println(id +  '\t' + content);
The problem is that I only get the first element.
UPDATE2 (I ended up doing it using regex):
if (str.startsWith("<doc")) {
                id = str.split("id")[1].substring(2).split("\"")[0];
                url = str.split("url")[1].substring(2).split("\"")[0];
                title = str.split("title")[1].substring(2).split("\"")[0];
     }
else if (str.startsWith("</doc")) {
                out.println(uniq_id +  '\t' + contect);
                content ="";
      } 
else {
                content = content + " " + str;
      }