I am working on a series of tab-delimited files which have a slightly odd structure. They are created with the bam-headcount package and contain sequence data and variant calls for each position in a short read of DNA sequence.
At some positions, there are no variant calls, at others there can be many. The number of tabs/columns in each row depends on the number of variant calls made (each variant will occupy a new column). For example:
234    A    3bp_del    4bp_ins
235    G
236    G.   15bp_ins   3bp_del    5bp_del
The difficulty arises when parsing the file with pandas using:
import pandas as pd
df = pd.read_csv(FILE, sep='\t')
This returns an error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5
The error occurs because pandas determine the number of columns it expects to see using the number of columns in the first row. I have a clumsy workaround, which appends a header with multiple columns to the file before parsing, but it will always append the same number of headers. Example:
Pos    Ref  Call1      Call2       Call3
234    A    3bp_del    4bp_ins
235    G
236    G.   15bp_ins   3bp_del    5bp_del
I'm looking for a way to count the number of tabs in the row with the greatest number of columns so that I can write a script to append that many column headers to the first line of each CSV file before parsing.
 
    