I have a dataframe like this:
peakID cytoband start end length 10.388_116 10.193_156 10.401_184 10.214_385
0 Amp_2q37.3_chr2:237990001-242193529 2q37.3 237990001 242193529 4203528 1 0 0 0
1 Del_2q37.3_chr2:226990001-242193529 2q37.3 226990001 242193529 15203528 -1 0 0 0
Notice how peakID is different, but cytoband is not. I need to unpivot this table (using a function from pyjanitor) without keeping peakID. Currently I do:
import pandas as pd
import pyjanitor
from natsort import natsort_keygen
table = (
table
.drop(columns="peakID")
.pivot_longer(index=["cytoband", "start", "end", "length"],
names_to="sample", values_to="state")
.sort_values(["cytoband", "sample"], key=natsort_keygen())
.remove_columns(["length", "start", "end"])
.set_index("cytoband")
)
And the end result looks like this:
table.loc["2q37.3", :]
Out[36]:
sample state
cytoband
2q37.3 10.193_156 0
2q37.3 10.193_156 0
2q37.3 10.214_385 0
2q37.3 10.214_385 0
2q37.3 10.388_116 1
2q37.3 10.388_116 -1
2q37.3 10.401_184 0
2q37.3 10.401_184 0
The problem lies in the fact that if cytoband is duplicated in different peakIDs, the resulting table will have the two records (state) for each sample mixed up (as they don't have the relevant unique ID anymore).
The idea would be to suffix the duplicate records across distinct peakIDs (e.g. "2q37.3_A", "2q37.3_B", but I'm not sure on how to do that with groupby or pandas in general as I need information from more than one group.
What's the cleanest solution to do this? Existing solutions (or this one) don't really fit.