Reshaping dataframes in pandas based on column labels

Question

What is the best way to reshape the following dataframe in pandas? This DataFrame df has x,y values for each sample (s1 and s2 in this case) and looks like this:

In [23]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [24]: df
Out[24]: 
       s1_x      s1_y      s2_x      s2_y
0  0.913462  0.525590 -0.377640  0.700720
1  0.723288 -0.691715  0.127153  0.180836
2  0.181631 -1.090529 -1.392552  1.530669
3  0.997414 -1.486094  1.207012  0.376120
4 -0.319841  0.195289 -1.034683  0.286073
5  1.085154 -0.619635  0.396867  0.623482
6  1.867816 -0.928101 -0.491929 -0.955295
7  0.920658 -1.132057  1.701582 -0.110299
8 -0.241853 -0.129702 -0.809852  0.014802
9 -0.019523 -0.578930  0.803688 -0.881875

s1_x and s1_y are the x/y values for sample 1, s2_x, s2_y are the sample values for sample 2, etc. How can this be reshaped into a DataFrame containing only x, y columns but that contains an additional column sample that says for each row in the DataFrame whether it's from s1 or s2? E.g.

          x         y      sample
0  0.913462  0.525590          s1
1  0.723288 -0.691715          s1
2  0.181631 -1.090529          s1
3  0.997414 -1.486094          s1
...
5  0.396867  0.623482          s2
...

This is useful for plotting things with Rpy2 later on, since many R plotting features can make use of this grouping variable, so that's my motivation for reshaping the dataframe.

I think the answer given by Chang She doesn't translate to dataframes that have a unique index, like this one:

In [636]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10), "names": range(10)})
In [637]: df
Out[637]: 
   names      s1_x      s1_y      s2_x      s2_y
0      0  0.672298  0.415366  1.034770  0.556209
1      1  0.067087 -0.851028  0.053608 -0.276461
2      2 -0.674174 -0.099015  0.864148 -0.067240
3      3  0.542996 -0.813018  2.283530  2.793727
4      4  0.216633 -0.091870 -0.746411 -0.421852
5      5  0.141301 -1.537721 -0.371601 -1.594634
6      6  1.267148 -0.833120  0.369516 -0.671627
7      7 -0.231163 -0.557398  1.123155  0.865140
8      8  1.790570 -0.428563  0.668987  0.632409
9      9 -0.820315 -0.894855  0.673247 -1.195831
In [638]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [639]: df.stack(0).reset_index(1)
Out[639]: 
  level_1         x         y
0      s1  0.672298  0.415366
0      s2  1.034770  0.556209
1      s1  0.067087 -0.851028
1      s2  0.053608 -0.276461
2      s1 -0.674174 -0.099015
2      s2  0.864148 -0.067240
3      s1  0.542996 -0.813018
3      s2  2.283530  2.793727
4      s1  0.216633 -0.091870
4      s2 -0.746411 -0.421852
5      s1  0.141301 -1.537721
5      s2 -0.371601 -1.594634
6      s1  1.267148 -0.833120
6      s2  0.369516 -0.671627
7      s1 -0.231163 -0.557398
7      s2  1.123155  0.865140
8      s1  1.790570 -0.428563
8      s2  0.668987  0.632409
9      s1 -0.820315 -0.894855
9      s2  0.673247 -1.195831

The transformation worked but in the process the column "names" was lost. How can I keep the "names" column in the df while still doing the melting transformation on the columns that have _ in their names? The "names" column just assigns a unique name to each row in the dataframe. It's numeric here for example but in my data they are string identifiers.

thanks.

Chang She · Accepted Answer · 2013-02-27T22:12:56.040

I'm assuming you already have the DataFrame. In which case you can just turn the columns into a MultiIndex and use stack then reset_index. Note that you'll then have to rename and reorder the columns and sort by sample to get exactly what you posted in the question:

In [4]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})

In [5]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [6]: df.stack(0).reset_index(1)
Out[6]: 
  level_1         x         y
0      s1  0.897994 -0.278357
0      s2 -0.008126 -1.701865
1      s1 -1.354633 -0.890960
1      s2 -0.773428  0.003501
2      s1 -1.499422 -1.518993
2      s2  0.240226  1.773427
3      s1 -1.090921  0.847064
3      s2 -1.061303  1.557871
4      s1 -1.697340 -0.160952
4      s2 -0.930642  0.182060
5      s1 -0.356076 -0.661811
5      s2  0.539875 -1.033523
6      s1 -0.687861 -1.450762
6      s2  0.700193  0.658959
7      s1 -0.130422 -0.826465
7      s2 -0.423473 -1.281856
8      s1  0.306983  0.433856
8      s2  0.097279 -0.256159
9      s1  0.498057  0.147243
9      s2  1.312578  0.111837

You can save the MultiIndex conversion if you can just create the DataFrame with a MultiIndex instead.

Edit: use merge to join original ids back in

In [59]: df
Out[59]: 
   names      s1_x      s1_y      s2_x      s2_y
0      0  0.732099  0.018387  0.299856  0.737142
1      1  0.914755 -0.798159 -0.732868 -1.279311
2      2 -1.063558  0.161779 -0.115751 -0.251157
3      3 -1.185501  0.095147 -1.343139 -0.003084
4      4  0.622400 -0.299726  0.198710 -0.383060
5      5  0.179318  0.066029 -0.635507  1.366786
6      6 -0.820099  0.066067  1.113402  0.002872
7      7  0.711627 -0.182925  1.391194 -2.788434
8      8 -1.124092  1.303375  0.202691 -0.225993
9      9 -0.179026  0.847466 -1.480708 -0.497067

In [60]: id = df.ix[:, ['names']]

In [61]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [62]: pandas.merge(df.stack(0).reset_index(1), id, left_index=True, right_index=True)
Out[62]: 
  level_1         x         y  names
0      s1  0.732099  0.018387      0
0      s2  0.299856  0.737142      0
1      s1  0.914755 -0.798159      1
1      s2 -0.732868 -1.279311      1
2      s1 -1.063558  0.161779      2
2      s2 -0.115751 -0.251157      2
3      s1 -1.185501  0.095147      3
3      s2 -1.343139 -0.003084      3
4      s1  0.622400 -0.299726      4
4      s2  0.198710 -0.383060      4
5      s1  0.179318  0.066029      5
5      s2 -0.635507  1.366786      5
6      s1 -0.820099  0.066067      6
6      s2  1.113402  0.002872      6
7      s1  0.711627 -0.182925      7
7      s2  1.391194 -2.788434      7
8      s1 -1.124092  1.303375      8
8      s2  0.202691 -0.225993      8
9      s1 -0.179026  0.847466      9
9      s2 -1.480708 -0.497067      9

Alternatively:

    In [64]: df
Out[64]: 
   names      s1_x      s1_y      s2_x      s2_y
0      0  0.744742 -1.123403  0.212736  0.005440
1      1  0.465075 -0.673491  1.467156 -0.176298
2      2 -1.111566  0.168043 -0.102142 -1.072461
3      3  1.226537 -1.147357 -1.583762 -1.236582
4      4  1.137675  0.224422  0.738988  1.528416
5      5 -0.237014 -1.110303 -0.770221  1.389714
6      6 -0.659213  2.305374 -0.326253  1.416778
7      7  1.524214 -0.395451 -1.884197  0.524606
8      8  0.375112 -0.622555  0.295336  0.927208
9      9  1.168386 -0.291899 -1.462098  0.250889

In [65]: df = df.set_index('names')

In [66]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [67]: df.stack(0).reset_index(1)
Out[67]: 
      level_1         x         y
names                            
0          s1  0.744742 -1.123403
0          s2  0.212736  0.005440
1          s1  0.465075 -0.673491
1          s2  1.467156 -0.176298
2          s1 -1.111566  0.168043
2          s2 -0.102142 -1.072461
3          s1  1.226537 -1.147357
3          s2 -1.583762 -1.236582
4          s1  1.137675  0.224422
4          s2  0.738988  1.528416
5          s1 -0.237014 -1.110303
5          s2 -0.770221  1.389714
6          s1 -0.659213  2.305374
6          s2 -0.326253  1.416778
7          s1  1.524214 -0.395451
7          s2 -1.884197  0.524606
8          s1  0.375112 -0.622555
8          s2  0.295336  0.927208
9          s1  1.168386 -0.291899
9          s2 -1.462098  0.250889

Could you explain the `reset_index` call and what it is doing here? Also, is it possible to get rid of the `level_1` index? I also don't see how I can access the labels columns for each row using this — , Feb 17 '13 at 03:50
After the call to stack, "s1 s2 s1 s2 ..." is part of the row labels (index). reset_index simply makes it a column instead. After this it's still a DataFrame so index/columns access is still the same. — Chang She, Feb 17 '13 at 14:40
I modified my question to reflect a problem with this solution for uniquely indexed dataframes. Do you have an idea of how this solution could be adapted to those? thank you — , Feb 27 '13 at 05:26
If that's the case, use merge or set the name to be the index to begin with. I'm updating the solution with this — Chang She, Feb 27 '13 at 22:06

Reshaping dataframes in pandas based on column labels

1 Answers1

Linked