I need to scan two large txt files (both about 100GB, 1 billion rows, several columns) and take out a certain column (write to new files). The files look like this
ID*DATE*provider
1111*201101*1234
1234*201402*5678
3214*201003*9012
...
My Python script is
N100 = 10000000   ## 1% of 1 billion rows
with open("myFile.txt") as f:
    with open("myFile_c2.txt", "a") as f2:
        perc = 0
        for ind, line in enumerate(f):   ## <== MemoryError
            c0, c1, c2  = line.split("*")
            f2.write(c2+"\n")
            if ind%N100 == 0: 
                print(perc, "%")
                perc+=1
Now the above script run well for one file but stuck for another one at 62%. The error message says MemoryError for for ind, line in enumerate(f):. I tried several times in different server with different RAM, the error is the same, all at 62%. I waited hours to monitor the RAM and it exploded to 28GB (total=32GB) when 62%. So I guess in that file there is a line that somehow too long (maybe not ended with \n ?) and thus Python stuck when trying reading it to the RAM.
So my question is, before I go to my data provider, what can I do to detect the error line and somehow get around/skip reading it as one huge line? Appreciate any suggestions!
EDIT:
The file, starting from the 'error line', might be all messed together with another line separator rather than \n. If that's the case, can I detect the line sep and continue extracting the columns I want, rather than throwing away them? Thanks!
 
     
    