I am running a code that has always worked for me. This time I ran it on 2 .csv files: "data" (24 MB) and "data1" (475 MB). "data" has 3 columns of about 680000 elements each, whereas "data1" has 3 columns of 33000000 elements each. When I run the code, I get just "Killed: 9" after some 5 minutes of processing. If this is a memory problem, how to solve it?. Any suggestion is welcome !
This is the code:
import csv
import numpy as np
from collections import OrderedDict # to save keys order
from numpy import genfromtxt
my_data = genfromtxt('data.csv', dtype='S', 
                 delimiter=',', skip_header=1) 
my_data1 = genfromtxt('data1.csv', dtype='S', 
                 delimiter=',', skip_header=1) 
d= OrderedDict((rows[2],rows[1]) for rows in my_data)
d1= dict((rows[0],rows[1]) for rows in my_data1) 
dset = set(d) # returns keys
d1set = set(d1)
d_match = dset.intersection(d1) # returns matched keys
import sys  
sys.stdout = open("rs_pos_ref_alt.csv", "w") 
for row in my_data:
    if row[2] in d_match: 
        print [row[1], row[2]]
The header of "data" is:
    dbSNP RS ID Physical Position
0   rs4147951   66943738
1   rs2022235   14326088
2   rs6425720   31709555
3   rs12997193  106584554
4   rs9933410   82323721
5   rs7142489   35532970
The header of "data1" is:
    V2  V4  V5
10468   TC  T
10491   CC  C
10518   TG  T
10532   AG  A
10582   TG  T
 
     
     
     
     
     
     
    