- I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
- I also have a list of 1000 unique items (vals).
I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.
What is the fastest way to achieve this?
My best try till now:
newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
     cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
    info = line.split()
    foundVals.append(info[0])
    newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]
Example inputFile:
2       9913
3       9913
4       9646
...
594592886       32630
594592888       32630
594592890       32630
vals:
[1,2,594592888]
wanted dictionary:
{2:9913,594592888:32630}
And in notFound:
[1]
 
     
     
     
    