I have about 10,000 of XML files with similar structure that I wish to convert to a single CSV file. Each XML file looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
    <S:Body>
        <ns7:GetStopMonitoringServiceResponse xmlns:ns3="http://www.siri.org.uk/siri" xmlns:ns4="http://www.ifopt.org.uk/acsb" xmlns:ns5="http://www.ifopt.org.uk/ifopt" xmlns:ns6="http://datex2.eu/schema/1_0/1_0" xmlns:ns7="http://new.webservice.namespace">
            <Answer>
                <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                <ns3:ProducerRef>ISR Siri Server (141.10)</ns3:ProducerRef>
                <ns3:ResponseMessageIdentifier>276480603</ns3:ResponseMessageIdentifier>
                <ns3:RequestMessageRef>0100700:1351669188:4684</ns3:RequestMessageRef>
                <ns3:Status>true</ns3:Status>
                <ns3:StopMonitoringDelivery version="IL2.71">
                    <ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
                    <ns3:Status>true</ns3:Status>
                    <ns3:MonitoredStopVisit>
                        <ns3:RecordedAtTime>2019-03-31T09:00:52.000+03:00</ns3:RecordedAtTime>
                        <ns3:ItemIdentifier>-881202701</ns3:ItemIdentifier>
                        <ns3:MonitoringRef>20902</ns3:MonitoringRef>
                        <ns3:MonitoredVehicleJourney>
                            <ns3:LineRef>23925</ns3:LineRef>
                            <ns3:DirectionRef>2</ns3:DirectionRef>
                            <ns3:FramedVehicleJourneyRef>
                                <ns3:DataFrameRef>2019-03-31</ns3:DataFrameRef>
                                <ns3:DatedVehicleJourneyRef>36962685</ns3:DatedVehicleJourneyRef>
                            </ns3:FramedVehicleJourneyRef>
                            <ns3:PublishedLineName>15</ns3:PublishedLineName>
                            <ns3:OperatorRef>15</ns3:OperatorRef>
                            <ns3:DestinationRef>26020</ns3:DestinationRef>
                            <ns3:OriginAimedDepartureTime>2019-03-31T08:35:00.000+03:00</ns3:OriginAimedDepartureTime>
                            <ns3:VehicleLocation>
                                <ns3:Longitude>34.78000259399414</ns3:Longitude>
                                <ns3:Latitude>32.042293548583984</ns3:Latitude>
                            </ns3:VehicleLocation>
                            <ns3:VehicleRef>37629301</ns3:VehicleRef>
                            <ns3:MonitoredCall>
                                <ns3:StopPointRef>20902</ns3:StopPointRef>
                                <ns3:ExpectedArrivalTime>2019-03-31T09:03:00.000+03:00</ns3:ExpectedArrivalTime>
                            </ns3:MonitoredCall>
                        </ns3:MonitoredVehicleJourney>
                    </ns3:MonitoredStopVisit>
                </ns3:StopMonitoringDelivery>
            </Answer>
        </ns7:GetStopMonitoringServiceResponse>
    </S:Body>
</S:Envelope>
The example above shows one MonitoredStopVisit nested tag, but every XML have about 4,000 of them. Full XML as an example can be found here.
I want to convert all the 10K files to one CSV where each record corresponds to a MonitoredStopVisit tag, so the CSV should look like this:

Currently this is my architecture:
- split the 10K files into 8 chunks (per my PC cores).
- Each sub-process iterates through its xml files and objectifies the xml.
- The object is then iterated, and per each element I use conditions to exclude/include data using an array.
- When the tag is /ns3:MonitoredStopVisit, the array is appended to a pandas dataframe as a series.
- When all sub-processes are done, the dataframes are merged and saved as CSV.
This is the xml to df code:
def xml_to_df(xml_file):
    from lxml import objectify
    xml_content = xml_file.read()
    obj = objectify.fromstring(xml_content)
    df_cols=[
        'RecordedAtTime',
        'MonitoringRef',
        'LineRef',
        'DirectionRef',
        'PublishedLineName',
        'OperatorRef',
        'DestinationRef',
        'OriginAimedDepartureTime',
        'Longitude',
        'Latitude',
        'VehicleRef',
        'StopPointRef',
        'ExpectedArrivalTime',
        'AimedArrivalTime'
        ]
    tempdf = pd.DataFrame(columns=df_cols)
    arr_of_vals = [""] * 14
    for i in obj.getiterator():
        if "MonitoredStopVisit" in i.tag or "Status" in i.tag and "false" in str(i):
            if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
                s = pd.Series(arr_of_vals, index=df_cols)
                if tempdf[(tempdf==s).all(axis=1)].empty:
                    tempdf = tempdf.append(s, ignore_index=True)
                    arr_of_vals =  [""] * 14
        elif "RecordedAtTime" in i.tag:
            arr_of_vals[0] = str(i)
        elif "MonitoringRef" in i.tag:
            arr_of_vals[1] = str(i)
        elif "LineRef" in i.tag:
            arr_of_vals[2] = str(i)
        elif "DestinationRef" in i.tag:
            arr_of_vals[6] = str(i)
        elif "OriginAimedDepartureTime" in i.tag:
            arr_of_vals[7] = str(i)
        elif "Longitude" in i.tag:
            if str(i) == "345353":
                print("Lon: " + str(i))
            arr_of_vals[8] = str(i)
        elif "Latitude" in i.tag:
            arr_of_vals[9] = str(i)
        elif "VehicleRef" in i.tag:
            arr_of_vals[10] = str(i)
        elif "ExpectedArrivalTime" in i.tag:
            arr_of_vals[12] = str(i)
    if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):  
        s = pd.Series(arr_of_vals, index=df_cols)
        if tempdf[(tempdf == s).all(axis=1)].empty:
            tempdf = tempdf.append(s, ignore_index=True)
    return tempdf
The problem is that for 10K files this takes about 10 hours with 8 sub-processors. When checking CPU/Mem usage, I can see that are not fully utilized.
Any idea how this can be improved? My next step is threading, but maybe there are other applicable ways. Just as a note, the order of records isn't important - I can sort it later.
 
     
    