I am trying to see if any values change in my dataframe, from row to row. One column in particular has a lot of None values.
If I check for inequality between None values, element-wise, it gives the answer I expect (that is, None != None returns False), but not when I do that same comparison on the column.
Here's a minimal working example:
>>> example_data = [None]*3 + ['a','b','b'] + [None]*4
>>> df = pd.DataFrame(example_data, columns = ['col1'])
>>> print df
col1
0 None
1 None
2 None
3 a
4 b
5 b
6 None
7 None
8 None
9 None
To see if the value changed from one line to the next, I'm shifting everything and comparing the values:
>>> did_it_change = (df != df.shift(1))
>>> print did_it_change
col1
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
So long as None is not involved in the comparison, it's accurate (e.g. row 5 changed the value to b from its previous value of b, so it did not change).
If I check a single element, it works as expected:
>>> df.loc[1,'col1'] != df.loc[2,'col1']
False
But that's not the answer that appears in the full column: did_it_change.loc[1,:] is True.
I do get the expected outcome if I use "" instead of None, but replacing None with empty string like in this question seems arbitrary and superfluous.
This is the outcome I'd expect.
>>> example_data = [""]*3 + ['a','b','b'] + [""]*4
>>> df = pd.DataFrame(example_data, columns = ['col1'])
>>> did_it_change = (df != df.shift(1))
>>> print did_it_change
col1
0 True
1 False
2 False
3 True
4 True
5 False
6 True
7 False
8 False
9 False
What's going on here? Why does the comparison with None give different answers if I'm calling it on a dataframe column instead of its elements?