Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

Question

This code compares based on condition the var1 and var2 and creates Results1 based on choices (this code works well):

# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df["var1"].notna()) & (df['var2'].notna()),
    (pd.isna(df["var1"])) & (pd.isna(df["var2"])),
    (df["var1"].notna()) & (pd.isna(df["var2"])),
    (pd.isna(df["var1"])) & (df['var2'].notna())]

choices = ["Both values", np.nan, df["var1"], df["var2"]]

df['Result1'] = np.select(conditions, choices, default=np.nan)

df looks like as it should:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         |

Now I want to compare multiple pandas columns (in my example var1 and var2, after var3 and var4, after var5 and var6) and based on condition and choices create corresponding Results column (in my example Result1, Result2, Result3). I thought the best way should be to use vectorization (because of better performance). The df I want to get should look like:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     | Result2     | Result3     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values | nan         | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values | Both values | nan         |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           | nan         | Both values |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           | Both values | nan         |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         | Both values | Both values |

I tried this:

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df[col1].notna()) & (df[col2].notna()),
    (pd.isna(df[col1])) & (pd.isna(df[col2])),
    (df[col1].notna()) & (pd.isna(df[col2])),
    (pd.isna(df[col1])) & (df[col2].notna())]

choices = ["Both values", np.nan, df[col1], df[col2]]

df[colR] = np.select(conditions, choices, default=np.nan)

Buy it gave me error:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Question: How to achieve my goal with vectorization (preferable because of better performance) or other method?

Always share the entire error message. – AMC Feb 08 '20 at 21:59 — AMC, Feb 08 '20 at 21:59

ALollz · Accepted Answer · 2020-02-06T20:54:19.560

2

The issue is that pandas DataFrames force alignment on the index, but df[col1] and df[col2] have no overlapping columns.

In this case, you really want to work with the underlying numpy arrays. Also because .isnull() is the opposite of notnull you can simplify this a lot. We'll concat to add the new columns back.

col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

  var1  var2 var3  var4 var5  var6      Result1      Result2      Result3
0    a   1.0  NaN   NaN    a   1.0  Both values          NaN  Both values
1    b   2.0    x   4.0  NaN   NaN  Both values  Both values          NaN
2    c   NaN  NaN   NaN    b   2.0            c          NaN  Both values
3  NaN   4.0    y   5.0  NaN   NaN            4  Both values          NaN
4  NaN   NaN    z   6.0    c   3.0          NaN  Both values  Both values

edited Feb 06 '20 at 20:54

answered Feb 06 '20 at 20:02

ALollz

57,915
7
66
89

Thank you. Can you please explain more why there should be used `null` related methods and instead of `NA` ones? – vasili111 Feb 07 '20 at 14:18
@vasili111 `DataFrame.isna` is an alias of `DataFrame.isnull` so they do the exact same thing. I personally prefer the `isnull` wording because I think it's more precise in describing what's happening. `pandas` also has the methods `pd.isnull(array-like-object)`, which can operate on array-like objects. You could use those too, i.e. `pd.isnull(Series)`, but you people generally use the DataFrame/Series methods instead `Series.isnull()` – ALollz Feb 07 '20 at 15:12
Do you have any thought why your code can give nan string when using `np.nan` and missing value when using `pd.NA`? More here:https://stackoverflow.com/questions/60570118/difference-between-nan-and-nan-and-error-that-maybe-caused-by-that – vasili111 Mar 06 '20 at 20:23

Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

1 Answers1

Linked