CSV_1.csv has the structure:
ABC
DEF
GHI
JKL
MNO
PQR
CSV_2.csv has the structure:
XYZ
DEF
ABC
CSV_2.csv is a lot smaller than CSV_1.csv and a lot of the rows that exist in CSV_2.csv appears in CSV_1.csv. I want to figure out if there are rows that exist in CSV_2.csv but not in CSV_1.csv.
These files are not sorted.
The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.
How would I go about doing this? I tried python but taking each row from CSV_2.csv and comparing with 10 million rows in CSV_1.csv takes a lot of time.
Here is what I tried in python:
with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'a') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
awk comes to mind. What would the exact code be for awk?