How to replace 'any strings' with nan in pandas DataFrame using a boolean mask?

Question

I have a 227x4 DataFrame with country names and numerical values to clean (wrangle ?).

Here's an abstraction of the DataFrame:

import pandas as pd
import random
import string
import numpy as np
pdn = pd.DataFrame(["".join([random.choice(string.ascii_letters) for i in range(3)]) for j in range (6)], columns =['Country Name'])
measures = pd.DataFrame(np.random.random_integers(10,size=(6,2)), columns=['Measure1','Measure2'])
df = pdn.merge(measures, how= 'inner', left_index=True, right_index =True)

df.iloc[4,1] = 'str'
df.iloc[1,2] = 'stuff'
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4

How do I replace string values with np.nan in all columns without touching the country names?

I tried using a boolean mask:

mask = df.loc[:,measures.columns].applymap(lambda x: isinstance(x, (int, float))).values
print(mask)

[[ True  True]
 [ True False]
 [ True  True]
 [ True  True]
 [False  True]
 [ True  True]]

# I thought the following would replace by default false with np.nan in place, but it didn't
df.loc[:,measures.columns].where(mask, inplace=True)
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4


# this give a good output, unfortunately it's missing the country names
print(df.loc[:,measures.columns].where(mask))

  Measure1 Measure2
0        6        3
1        3      NaN
2        7        2
3        7        8
4      NaN        3
5        7        4

I have looked at several questions related to mine ([1], [2], [3], [4], [5], [6], [7], [8]), but could not find one that answered my concern.

"A meta-question, Is it normal that it takes me more than 3 hours to formulate a question here (including research) ?" – Yes. The success of [so] and the entire Stack Exchange network is predicated on the high quality of its content, both questions and answers. You can't throw together a high quality question in a couple of minutes. Personally, I'd put the required effort more on the order of days than hours. I certainly have spent an entire day or more on an answer, and I expect the asker to expend at least an order of magnitude more effort, since he's the one getting the benefit. — Jörg W Mittag, Oct 29 '17 at 17:57
@JörgWMittag I was just counting the time put to write the question after I gave up trying on my own. If I had to count that it would be in days indeed. I'll make a question in meta when I have a few more hours in front of me. I was feeling dumb taking so much time to ask my question. But I feel better now and the quality of the answer is a proof that it was well worth the effort. Thank you! — Malik Koné, Oct 29 '17 at 18:09

score 13 · Accepted Answer · answered Oct 29 '17 at 14:38

13

Assign only columns of interest:

cols = ['Measure1','Measure2']
mask = df[cols].applymap(lambda x: isinstance(x, (int, float)))

df[cols] = df[cols].where(mask)
print (df)
  Country Name Measure1 Measure2
0          uFv        7        8
1          vCr        5      NaN
2          qPp        2        6
3          QIC       10       10
4          Suy      NaN        8
5          eFS        6        4

A meta-question, Is it normal that it takes me more than 3 hours to formulate a question here (including research) ?

In my opinion yes, create good question is really hard.

answered Oct 29 '17 at 14:38

jezrael

822,522
95
1,334
1,252

I like you asnwer, but why `df2= df.loc[:,measures.columns].where(mask, inplace=True)` does not do the replacement ? While `df.loc[:,measures.columns].where(mask)` prints out correctly. – Malik Koné Oct 29 '17 at 14:45
Because `inplace` always return `None`, so `df2` is None – jezrael Oct 29 '17 at 14:47
I've edited the question.. I don't understand why df.loc[:,measures.columns].where(mask, inplace=True) does not modify df ? – Malik Koné Oct 29 '17 at 15:11
1

I think there is problem with assign to copy of `df`, same problem like `fillna` in [this](https://stackoverflow.com/a/43642196/2901002). If change your code to `df[measures.columns].where(mask)` get warning. – jezrael Oct 29 '17 at 15:11

score 12 · Answer 2 · edited Oct 29 '17 at 21:30

12

Use numeric with errors coerce i.e

cols = ['Measure1','Measure2']
df[cols] = df[cols].apply(pd.to_numeric,errors='coerce')

 Country Name  Measure1  Measure2
0          PuB       7.0       6.0
1          JHq       2.0       NaN
2          opE       4.0       3.0
3          pxl       3.0       6.0
4          ouP       NaN       4.0
5          qZR       4.0       6.0

edited Oct 29 '17 at 21:30

leo

8,106
7
48
80

answered Oct 29 '17 at 14:56

Bharath M Shetty

30,075
6
57
108

2

I think we can get rid of `lambda` in this case: `df[cols] = df[cols].apply(pd.to_numeric, errors='corece')` – MaxU - stand with Ukraine Oct 29 '17 at 15:15
1

@Bharathshetty, your answer is too good (if that is possible). I will indeed coerce the string to numeric values but this was not clear to me when I formulated the question. My focus was on how to use the boolean mask and why the inplace did not work. – Malik Koné Oct 29 '17 at 18:21
1

@Bharathshetty I think one should read `errors=coerce` instead of `errors=corece` – Malik Koné Oct 29 '17 at 18:29
That was a small typo. Sorry for that – Bharath M Shetty Oct 30 '17 at 06:05

MaxU - stand with Ukraine · Answer 3 · 2017-10-29T15:08:42.513

9

cols = ['Measure1','Measure2']
df[cols] = df[cols].applymap(lambda x: x if not isinstance(x, str) else np.nan)

or

df[cols] = df[cols].applymap(lambda x: np.nan if isinstance(x, str) else x)

Result:

In [22]: df
Out[22]:
  Country Name  Measure1  Measure2
0          nBl      10.0       9.0
1          Ayp       8.0       NaN
2          diz       4.0       1.0
3          aad       7.0       3.0
4          JYI       NaN      10.0
5          BJO       9.0       8.0

edited Oct 29 '17 at 15:08

answered Oct 29 '17 at 14:49

MaxU - stand with Ukraine

205,989
36
386
419

But why the negation `x if not isinstance(x, str)` instead of `x if isinstance(int,float) else `np.nan` ? – Malik Koné Oct 29 '17 at 15:04
1

That will replace all numbers with nan if you dont need negation then `x: np.nan if isinstance(x, str) else x` – Bharath M Shetty Oct 29 '17 at 15:06
I don't want to replace number.. I want to replace non numbers with nan – Malik Koné Oct 29 '17 at 15:07
@MalikKoné, I think you want to use [Bharath shetty's solution](https://stackoverflow.com/a/47001627/5741205) – MaxU - stand with Ukraine Oct 29 '17 at 15:09
All three answers are very interesting for me... My focus is on understanding I don't have to optimization of physical resources yet. :o) – Malik Koné Oct 29 '17 at 15:15
@MalikKoné, if your goal is to clean up numeric columns - i.e. replace all values to numeric dtypes, and replacing those that can't be converted to numeric dtypes to NaN's, then Bharath shetty's solution is the most idiomatic approach. If you want to replace cells of specific dtypes with NaN's then you can choose between jezrael's and mine solution... – MaxU - stand with Ukraine Oct 29 '17 at 15:17

How to replace 'any strings' with nan in pandas DataFrame using a boolean mask?

3 Answers3

Linked