Python newbie and stackoverflow posting newbie.
My goal is to create a python script which will take two files entered at the command line and drop columns from the first file if column headers are present within the second file, and write the output to a new file.
I've tried several approaches to this, and currently I am attempting to use Pandas DataFrame.drop
On a very small test set, I can achieve the removal of columns by manually specifying headers in a string (thanks to Delete column from pandas DataFrame), but can't figure out how to import a list of column headers from a file and format them correctly for DataFrame.drop.
I have two files One large: quarter of a million rows and up to 21,000 columns The columns relate to samples, and the rows relate to genetic markers.
I also have a smaller file containing up to a 1000 sample IDs, which correspond to the column headers in the large file. These relate to columns that I wish to drop from the large file.
I have attempted many things (creating lists, creating labels), one example below, but failed.
I would be grateful if anyone could point me in the right direction.
large file
    Name     Chr     Position     8077686010_R04C02.GType     8077686010_R04C02.X     8077686010_R04C02.Y     8131566005_R01C02.GType     8131566005_R01C02.X     8131566005_R01C02.Y
exm-rs1000026     21     38934599     NC     0.0144234     1.112413     NC     0.01250324     1.084685
exm-rs1000053     2     12790328     NC     0.04906762     1.495594     NC     0.07344548     1.552252
exm-rs1000110     9     117908721     NC     0.02433169     1.314785     NC     0.05954991     1.356415
exm-rs1000113     5     150240076     NC     0.015468     0.793373     NC     0.02498361     0.8621324
exm-rs1000158     20     36599904     NC     0.01016421     0.7593179     NC     0.4537758     0.5095596
exm-rs1000192     16     6747139     NC     0.01774782     0.8661015     NC     0.01103768     0.9004255
exm-rs1000203     14     40896108     NC     0.7707067     0.006222768     NC     0.7400684     0.003768863
smallerfile
8077686010_R04C02.GType
8077686010_R04C02.X
8077686010_R04C02.Y
outfile
   Name     Chr     Position     8131566005_R01C02.GType     8131566005_R01C02.X     8131566005_R01C02.Y
exm-rs1000026     21     38934599     NC     0.01250324     1.084685
exm-rs1000053     2     12790328     NC     0.07344548     1.552252
exm-rs1000110     9     117908721     NC     0.05954991     1.356415
exm-rs1000113     5     150240076     NC     0.02498361     0.8621324
exm-rs1000158     20     36599904     NC     0.4537758     0.5095596
exm-rs1000192     16     6747139     NC     0.01103768     0.9004255
exm-rs1000203     14     40896108     NC     0.7400684     0.003768863
Working code
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
new_data = largefile.drop(['8077686010_R04C02.GType','8077686010_R04C02.X','8077686010_R04C02.Y',], axis=1)
new_data.to_csv(outfile,sep="\t")
Failing code - one of many
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
dropcols = open("smallerfile",'r').read().split('\n')
new_data = largefile.drop(dropcols, axis=1)
new_data.to_csv(outfile,sep="\t")
List generated
['8131566005_R01C02.GType', '8131566005_R01C02.X', '8131566005_R01C02.Y', '8131566013_R02C01.GType', '8131566013_R02C01.X', '8131566013_R02C01.Y', '']
Output
Traceback (most recent call last):
File "my.py", line 59, in <module>
new_data = largefile.drop(dropcolslst, axis=1)
File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 174, in drop
new_axis = axis.drop(labels)
File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 881, in drop
raise ValueError('labels %s not contained in axis' % labels[mask])
ValueError: labels ["] not contained in axis
 
     
     
    