grouping the data in a text file using python

Question

I have a tab separated text file like this small example:

small example:

chr15   60644581    60644675    ANXA2   ENST00000332680.4   annotated
chr15   60643922    60644018    ANXA2   ENST00000332680.4   annotated
chr15   60643391    60643450    ANXA2   ENST00000332680.4   annotated
chr15   60641273    60641396    ANXA2   ENST00000332680.4   annotated
chr15   60639831    60639888    ANXA2   ENST00000332680.4   annotated
chr14   25443870    25444024    STXBP6  ENST00000550887.1   annotated
chr14   25326232    25326363    STXBP6  ENST00000550887.1   annotated
chr14   25325141    25325307    STXBP6  ENST00000550887.1   annotated
chr14   25288242    25288400    STXBP6  ENST00000550887.1   annotated

I want to make another tab separated file with 6 columns but less rows. in fact I want to summarize the rows. rows with the same 4th column must be summarized in one row in fact at first I need to group the rows based on 4th column. the new row has the same 1st, 4th, 5th and 6th column but its 2nd column comes from the first row of each group and the 3rd column comes from the last row of each group. in the small example we have 2 groups based on 4th column so for the first group we would have this row:

chr15   60644581    60639888    ANXA2   ENST00000332680.4   annotated

and the expected output for the small example would look like this:

expected output:

chr15   60644581    60639888    ANXA2   ENST00000332680.4   annotated
chr14   25443870    25288400    STXBP6  ENST00000550887.1   annotated

I am trying to do that in python using the following command but it does not return the expected output. do you know how to fix it?

import pandas as pd
p = pd.read_csv("myfile.txt", sep='\t')
data = p.groupby('4th column')
mylist = []
for i in range(len(data)):
    new_line = 'data[0], data[1], data[2], data[3], data[4], data[5]'
    mylist.append(new_line)
    df = pd.DataFrame(mylist)
    df.to_csv('outfile.txt', sep='\t', index=None, header=False)

looks like you need to initialize your `pd.DataFrame` before the loop and write to csv also after the loop - cuz you are overwriting your .csv file on each iteration — Chiefir, Dec 28 '18 at 13:48

score 2 · Answer 1 · answered Dec 28 '18 at 14:14

You do not specify what should be the grouped values of 1st, 5th and 6th columns so I will assume that any value will do, and will use first as with the 2nd column.

What you are looking for is the aggregation after groupby which would give the proper values. Pass a dictionary of {column name: aggregation function name} pairs to agg, especially {**"2": "first"**, **"3": "last"**}.

import pandas as pd
p = pd.read_csv("myfile.txt", names=["1", "2", "3", "4", "5", "6"], header=None, sep='\t')

df = p.groupby("4").agg({"1": "first", "2": "first", "3": "last", "5": "first", "6": "first"})

df.reset_index(inplace=True)
df = df.reindex(sorted(df.columns), axis=1)
df.to_csv('outfile.txt', sep='\t', index=False, header=None)

grouping the data in a text file using python

1 Answers1

Linked