I'm trying to process a 4.6GB XML file with the following code:
context = ET.iterparse(file_name_data, events=("start", "end"))
in_pandcertificaat = False
pandcertificaat = {}
pandcertificaten = []
number_of_pickles = 0
for index, (event, elem) in enumerate(context):
if event == "start" and elem.tag == "Pandcertificaat":
in_pandcertificaat = True
pandcertificaat = {} # Initiate empty pandcertificaat.
continue
elif event == "end" and elem.tag == "Pandcertificaat":
in_pandcertificaat = False
pandcertificaten.append(pandcertificaat)
continue
elif in_pandcertificaat:
pandcertificaat[elem.tag] = elem.text
else:
pass
if index % iteration_interval_for_internal_memory_check == 0:
print(f"index = {index:.2e}")
process = psutil.Process(os.getpid())
internal_memory_usage_in_mb = process.memory_info().rss / (1024 * 1024)
print(f"Memory usage = {internal_memory_usage_in_mb:.2f} * MB.")
if internal_memory_usage_in_mb > internal_memory_usage_limit_for_splitting_data_in_mb:
df = pd.DataFrame(pandcertificaten)
path_temporary_storage_data_frame = f"{base_path_temporary_storage_data_frame}{number_of_pickles}.{file_name_extension_pickle}"
df.to_pickle(path_temporary_storage_data_frame)
print(f"Intermediately saving data frame to {path_temporary_storage_data_frame} to save internal memory.")
number_of_pickles += 1
pandcertificaten.clear()
gc.collect()
As you can see I try to save RAM by intermediately saving the Pandas data frames to files on disk but for some reason the RAM usage still keeps increasing. Even after adding gc.collect(), hopefully forcing garbage collection.
This is an example of the output I'm getting:
index = 3.70e+07
Memory usage = 2876.80 * MB.
Intermediately saving data frame to data_frame_pickles/26.pickle to save internal memory.
index = 3.80e+07
Memory usage = 2946.93 * MB.
Intermediately saving data frame to data_frame_pickles/27.pickle to save internal memory.
index = 3.90e+07
Memory usage = 3017.31 * MB.
Intermediately saving data frame to data_frame_pickles/28.pickle to save internal memory.
What am I doing wrong?
UPDATE 2023-03-17, 14:37.
The problem just got weirder. If I comment everything in the for loop, the RAM usage stillm keeps increasing in time. I believe it follows that there is a problem with iterparse. And the out of RAM problem occurs when using lxml or xml.etree.ElementTree. I did not try the XMLPullParser yet, as suggested by @Hermann12.