Stack, unstack, melt, pivot, transpose? What is the simple method to convert multiple columns into rows (PySpark or Pandas)?)

Question

My work environment mainly use PySpark, but doing some Googling, it is very complex to transpose in PySpark. I would like to keep it in PySpark but if it's much easier to do it in Pandas, I will convert the Spark dataframe to a Pandas dataframe. The dataset isn't so big where performance is an issue I would think.

I would like to transform a dataframe with multiple columns into rows:

Input:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Record          Hospital       Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4  
     1         Red Cross        1234 Street 429  Effective  Effective     Normal  Effective    
     2  Alberta Hospital   553 Alberta Road 441   Effecive     Normal     Normal  Effective
     3  General Hospital  994 Random Street 923     Normal  Effective     Normal  Effective

Output:

    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

Upon looking at PySpark examples, it is complicated: PySpark Dataframe melt columns into rows

And looking at Pandas example, it looks much easier. But there are many different Stack Overflow answers with some saying to use pivot, melt, stack, unstack, and more that it ends up being confusing.

So if anyone has an easy way to do this in PySpark, I am all ears. If not, I will happily take Pandas answers.

Thank you very much for your help!

IIUC, you can use `explode` in pyspark see https://stackoverflow.com/questions/55378047/pyspark-dataframe-melt-columns-into-rows — Umar.H, Oct 03 '20 at 00:53
Hello, I have removed the images and edited your question, but in the future please see: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — David Erickson, Oct 03 '20 at 01:28

Andy L. · Answer 1 · 2020-10-03T02:28:21.207

Here is pandas using stack

df_final =  (df.set_index(['Record', 'Hospital', 'Hospital Address'])
               .stack(dropna=False)
               .rename('Value')
               .reset_index()
               .rename({'level_3': 'Name'},axis=1)
               .assign(Record=lambda x: x.index+1))

Out[120]:
    Record          Hospital       Hospital Address       Name       Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

Hi Andy, can you explain to me why you use stack in this case instead of pivot/melt? — Anonymous, Oct 03 '20 at 01:55
@Anonymous: `pivot` turn values to index and columns. Yours is turning columns to values, so we can't use `pivot`. `Melt` is a possible candidate. However, `melt` will melt columns to this order `Medicine_1, Medicine_1, Medicine_1, Medicine_2, Medicine_2, Medicine_2,...`. You need a `sort_values` to make it `Medicine_1, Medicine_2, Medicine_3, Medicine_4, Medicine_1, Medicine_2, Medicine_3, Medicine_4...` while `stack` returns right away `Medicine_1, Medicine_2, Medicine_3, Medicine_1, Medicine_2, Medicine_3,...`. That is the reason I chose `stack` — Andy L., Oct 03 '20 at 02:15
@Anonymous: if you want sequential values in `Record` column, just chain additional `assign` to turn it to a sequence as in my updated answer — Andy L., Oct 03 '20 at 02:23

David Erickson · Accepted Answer · 2020-10-03T01:48:41.593

You can also use .melt and specify the id_vars. Everything else will be consider value_vars. The number of value_vars columns you have will multiply the number of rows in your dataframe by that number, stacking all of the column information from the four columns into one column and will duplicate the id_var columns into your desired format:

Dataframe setup:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Code:

df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
              var_name='Name',
              value_name='Value')
     .sort_values('Record')
     .reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]: 
    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

Hi David, can you explain to me why you chose to use melt here in this scenario as opposed to stack/pivot? Thank you. — Anonymous, Oct 03 '20 at 01:55
@Anonymous you can get to the same solution with multiple methods as you can see with Andy's `stack` answer. I just think `melt` can be a little bit cleaner, but Andy's stack answer works as well. It's 3 operation instead of 5, so a little bit more concise and perhaps efficient. — David Erickson, Oct 03 '20 at 01:57

Equinox · Answer 3 · 2020-10-06T08:31:28.940

It is fairly simple/easy to do it with pyspark too using stack.

# create sample data 
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
df = spark.createDataFrame(panda_df)

# calculate
df.select("Hospital","Hospital Address", 
          expr("stack(4, 'Medicine_1', Medicine_1, 'Medicine_2', Medicine_2, \
          'Medicine_3', Medicine_3,'Medicine_4',Medicine_4) as (MedicinName, Effectiveness)")
         ).where("Effectiveness is not null").show()

Dynamic query generation in case of lot of columns:

The main idea here is to create the stack(x,a,b,c) dynamically. We can leverage python string formatting to make the dynamic sring.

index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c  for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()

Output:

+----------------+--------------------+-----------+-------------+
|        Hospital|    Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
|       Red Cross|     1234 Street 429| Medicine_1|    Effective|
|       Red Cross|     1234 Street 429| Medicine_2|    Effective|
|       Red Cross|     1234 Street 429| Medicine_3|       Normal|
|       Red Cross|     1234 Street 429| Medicine_4|    Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1|     Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4|    Effective|
|General Hospital|994 Random Street...| Medicine_1|       Normal|
|General Hospital|994 Random Street...| Medicine_2|    Effective|
|General Hospital|994 Random Street...| Medicine_3|       Normal|
|General Hospital|994 Random Street...| Medicine_4|    Effective|
+----------------+--------------------+-----------+-------------+

Hey Venky, thanks for letting me know about the stack function in PySpark. In my actual dataframe, there's like 20+ medicines. It will take a long time to do 'Medicine_1', Medicine_1. Possible to skip alias and do a list instead? — Anonymous, Oct 05 '20 at 21:13
@Anonymous I have updated my answer to handle large number of columns and to generate data dynamicaaly. Would have been much cleaner/easier if stack was a pyspark function and not a sql function. — Equinox, Oct 06 '20 at 08:32

Stack, unstack, melt, pivot, transpose? What is the simple method to convert multiple columns into rows (PySpark or Pandas)?)

3 Answers3