how to flatten each row of pandas dataframe?

Question

I have a pandas dataframe

                             state  action  reward  absorb
0   [1.0, 2.0, 0.0, 0.0, 0.0, 0.0]     0.0     0.0   False
1   [0.0, 0.0, 4.0, 4.0, 5.0, 0.0]     3.0     1.0   False
2   [0.0, 0.0, 0.0, 2.0, 0.0, 1.0]     5.0     1.0   False

...

and I would like to convert this dataframe into

    s1  s2  s3  s4  s5  s6  action  reward
0  1.0 2.0 0.0 0.0 0.0 0.0     0.0     0.0
1  0.0 0.0 4.0 4.0 5.0 0.0     3.0     1.0

...

where I decompose my first column into several columns. How would I do that easily?

Thank you!

Use `df[['s1', 's2','s3', 's4','s5', 's6']] = df['state'].apply(pd.Series)`. — harvpan, Aug 07 '18 at 21:23
That's not quite so easy. Are all the lists the same length? What do you expect in their place if not? — roganjosh, Aug 07 '18 at 21:23

sacuL · Accepted Answer · 2018-08-07T23:09:27.137

To avoid using apply (which could be slow for a large dataframe):

new_df = pd.concat([df[['action', 'reward', 'absorb']],
                    pd.DataFrame(df.state.tolist(),
                                 columns = [f's{i}' for i in range(1,7)])],
                   axis=1)

>>> new_df
   action  reward  absorb   s1   s2   s3   s4   s5   s6
0     0.0     0.0   False  1.0  2.0  0.0  0.0  0.0  0.0
1     3.0     1.0   False  0.0  0.0  4.0  4.0  5.0  0.0
2     5.0     1.0   False  0.0  0.0  0.0  2.0  0.0  1.0

Benchmarks:

On a moderately sized dataframe, you'll see some large time improvements as opposed to apply. I've added 2 other vectorized solutions by @piRSquared (in comments) to compare as well

# Create a dataframe of 1000 values

df = pd.DataFrame({'state':np.random.choice(df.state.values, size = 1000),
                   'action': np.random.randint(0,10,1000),
                   'reward': np.random.randint(0,10,1000),
                   'absorb': np.random.choice([True, False, 1000])})

>>> df.head()
   absorb  action  reward                           state
0       1       6       8  [0.0, 0.0, 0.0, 2.0, 0.0, 1.0]
1       1       3       2  [0.0, 0.0, 4.0, 4.0, 5.0, 0.0]
2       1       8       3  [1.0, 2.0, 0.0, 0.0, 0.0, 0.0]
3       1       4       2  [0.0, 0.0, 0.0, 2.0, 0.0, 1.0]
4       1       6       3  [0.0, 0.0, 4.0, 4.0, 5.0, 0.0]

def concat_method(df1 = df.copy()):
    return pd.concat([df1[['action', 'reward', 'absorb']],
                    pd.DataFrame(df1.state.tolist(),
                                 columns = [f's{i}' for i in range(1,7)])],
                   axis=1)


def apply_method(df1 = df.copy()):
    df1[['s1', 's2','s3', 's4','s5', 's6']] = df1['state'].apply(pd.Series)
    return df1

def piR_method(df1 = df.copy()):
    return df1.assign(**dict((f"s{i}", z) for i, z in enumerate(zip(*df1.state)))).drop('state', 1)

def piR_method2(df1 = df.copy()):
    return df1.drop('state', 1).join(pd.DataFrame(df1.state.tolist(), df1.index).rename(columns=lambda x: f"s{x + 1}"))

def pir3(df=df):
    mask = df.columns.values != 'state'
    vals = df.values
    state = vals[:, np.flatnonzero(~mask)[0]].tolist()
    other = vals[:, mask]
    newv = np.column_stack([other, state])
    cols = df.columns.values[mask].tolist()
    sss = [f"s{i}" for i in range(1, max(map(len, state)) + 1)]

    return pd.DataFrame(newv, df.index, cols + sss)


import timeit

>>> timeit.timeit(concat_method, number = 100) / 100
0.0020290906500304118
>>> timeit.timeit(apply_method, number = 100) / 100
0.19950388665980426
>>> timeit.timeit(piR_method, number = 100) / 100
0.003522267839871347
>>> timeit.timeit(piR_method2, number = 100) / 100
0.002374379680259153
>>> timeit.timeit(pir3, number = 100)
0.17464107400155626

Umm, I can't test but I'm thinking `contat` is slower than apply — roganjosh, Aug 07 '18 at 21:31
@roganjosh Unlikely, as `apply` is basically iterating row by row. See benchmarks in my edited answer. — sacuL, Aug 07 '18 at 21:40
That surprises me. Have an upvote (and thanks for timing it) — roganjosh, Aug 07 '18 at 21:45
Try this one `df.assign(**dict((f"s{i}", z) for i, z in enumerate(zip(*df.state), 1))).drop('state', 1)` — piRSquared, Aug 07 '18 at 21:47
Or `df.drop('state', 1).join(pd.DataFrame(df.state.tolist(), df.index).rename(columns=lambda x: f"s{x + 1}"))` This is just a variant of yours. — piRSquared, Aug 07 '18 at 21:51
They're fast! Especially that second method (see updated benchmarks). `concat` beats it, but by a hair, it's a pretty negligible difference — sacuL, Aug 07 '18 at 21:53

how to flatten each row of pandas dataframe?

1 Answers1

Benchmarks: