Filter pandas DataFrame by substring criteria

Question

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.

Something like this idiom:

re.search(pattern, cell_in_question)

returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.

score 1344 · Answer 1 · edited Nov 27 '22 at 20:31

1344

Vectorized string methods (i.e. Series.str) let you do the following:

df[df['A'].str.contains("hello")]

This is available in pandas 0.8.1 and up.

edited Nov 27 '22 at 20:31

wjandrea

28,235
9
60
81

answered Jul 17 '12 at 21:52

Garrett

47,045
6
61
50

7

How do we go about "Hello" and "Britain" if I want to find them with "OR" condition. – LonelySoul Jun 27 '13 at 16:41
129

Since str.* methods treat the input pattern as a regular expression, you can use `df[df['A'].str.contains("Hello|Britain")]` – Garrett Jun 27 '13 at 19:20
15

Is it possible to convert `.str.contains` to use [`.query()` api](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html#pandas.DataFrame.query)? – zyxue Mar 01 '17 at 17:25
5

@zyxue [Select rows by partial string query with pandas](https://stackoverflow.com/q/44933071/395857) – Franck Dernoncourt Jul 05 '17 at 18:01
9

`df[df['value'].astype(str).str.contains('1234.+')]` for filtering out non-string-type columns. – François Leblanc Feb 13 '18 at 20:22
What about "AND" condition - how do you look for multiple strings on the same time, so not just `"hello"`, but `["hello", "hey", "hi"]`? – NeStack Nov 30 '18 at 17:32
2

to "AND" substrings when the order of substrings is important/known, you could use `df[df.A.str.contains("STR1.*STR2")]`. if order is unimportant/unknown, `df[df.A.str.contains("STR1") & df.A.str.contains("STR2")]` – Garrett Nov 30 '18 at 19:35
1

@NeStack I've added more information about multiple substring searches [here](https://stackoverflow.com/a/55335207/4909087). – cs95 Mar 25 '19 at 20:20
2

If there are nulls in the column, one must also include the flag to ignore these (if desired): `df[df['A'].str.contains("hello", na=False)]` – defraggled Sep 29 '20 at 02:48
how to apply if else e.g if contains ;dothis else do this – G.ONE Jan 20 '21 at 12:46
As for now (pandas `1.3.4`), this throws an error `cannot index with vector containing NA / NaN values`, only the above solution posted by @sharon works. – Ibrahim.H Aug 09 '22 at 12:56

sharon · Answer 2 · 2021-11-20T15:58:01.673

374

I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:

df[df["A"].str.contains("Hello|Britain")]

and got an error:

cannot index with vector containing NA / NaN values

but it worked perfectly when an "==True" condition was added, like this:

df[df['A'].str.contains("Hello|Britain")==True]

edited Nov 20 '21 at 15:58

answered Nov 10 '14 at 17:05

sharon

4,406
1
17
10

17

`df[df['A'].astype(str).str.contains("Hello|Britain")]` worked as well – Nagabhushan S N Feb 05 '20 at 15:05
2

Another solution would be: ``` df[df["A"].str.contains("Hello|Britain") == True] ``` – Allan Sep 15 '21 at 19:36

score 301 · Answer 3 · edited Jul 25 '22 at 17:54

How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)

...and would like to know more about what methods should be preferred over others.

(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)

Friendly disclaimer, this is post is long.

Basic Substring Search

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.

Here is an example of regex-based search,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Sometimes regex search is not required, so specify regex=False to disable it.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

Performance wise, regex search is slower than substring search:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoid using regex-based search if you don't need it.

Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in

ValueError: cannot index with vector containing NA / NaN values

This is usually because of mixed data or NaNs in your object column,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:

# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)

       A      B
0   True   True
1   True  False
2  False   True
3   True  False
4  False  False
5  False  False

All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).

If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.

Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

You can also create a list of terms, then join them:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...

. ^ $ * + ? { } [ ] \ | ( )

Then, you'll need to use re.escape to escape them:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape has the effect of escaping the special characters so they're treated literally.

re.escape(r'.foo^')
# '\\.foo\\^'

Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).

For example,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

Now consider,

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Multiple Whole Word Search

Similar to the above, except we add a word boundary (\b) to the joined pattern.

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Where p looks like this,

p
# '\\b(?:foo|baz)\\b'

A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

Instead of,

df1[df1['col'].str.contains('foo', regex=False)]

Use the in operator inside a list comp,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Instead of,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Use re.compile (to cache your regex) + Pattern.search inside a list comp,

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

If "col" has NaNs, then instead of

df1[df1['col'].str.contains(regex_pattern, na=False)]

Use,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

More Options for Partial String Matching: `np.char.find`, `np.vectorize`, `DataFrame.query`.

In addition to str.contains and list comprehensions, you can also use the following alternatives.

np.char.find
Supports substring searches (read: no regex) only.

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Regex solutions possible:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.

Recommended Usage Precedence

(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query

Could you edit in the correct method to use when searching for a string in two or more columns? Basically: `any(needle in haystack for needling in ['foo', 'bar'] and haystack in (df['col'], df['col2']))` and variations I tried all choke (it complains about `any()` and rightly so... But the doc is blissfully unclear as to how to do such a query. — Denis de Bernardy, Jul 16 '19 at 11:37
@DenisdeBernardy `df[['col1', 'col2']].apply(lambda x: x.str.contains('foo|bar')).any(axis=1)` — cs95, Jul 28 '19 at 06:30
@cs95 [Extracting rows with substring containing whitespace after + in pandas df](https://stackoverflow.com/questions/57238715/extracting-rows-with-substring-containing-whitespace-after-in-pandas-df) It was answered soon, but you might want to have a look at it. — , Jul 28 '19 at 07:41
@ankiiiiiii Looks like you missed the part of my answer where I mentioned regex metacharacters: "Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters". — cs95, Jul 28 '19 at 07:53
@cs95 at that time, I thought only of space as the problem and searched this page and documentation page for the same. While writing the question, special char came to my mind and I got quick answers before I could search for it. Great answer of yours! +1 — , Jul 28 '19 at 07:55
Very helpful! Why do you need to insert `r` if regex = True by default? eg here: `df1[df1['col'].str.contains(r'foo(?!$)')]` — 00schneider, Aug 13 '19 at 10:54
@00schneider r in this case is used to indicate a raw string literal. These make it easier to write regular expression strings. https://stackoverflow.com/q/2081640/ — cs95, Aug 13 '19 at 13:11
@MurtazaHaji you can .apply() any of these solutions across multiple columns. For more info on how to apply an operation column wise take a look at https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code — cs95, Jun 05 '20 at 17:48
@MurtazaHaji Made an edit to my answer that shows you how to do that (it's in the "Basic Substring Search" section). — cs95, Jul 08 '20 at 07:24
I tried the list comprehension, but in my case it was 6x slower than `str.contains(my_string, regex=False)` — arno_v, Aug 03 '21 at 07:13
@arno_v That's good to hear, looks like pandas performance is improving! — cs95, Aug 05 '21 at 09:53
Extremely helpful !! Especially the 'import re' features are game changers. Chapeau! — Lorenzo Bassetti, Sep 07 '21 at 18:35
Such a great answer - thank you! For me the issue was not even realizing that the parens in my pattern were interpreted as a regex, and the warning didn't make that very clear. So I got the warning changed so it now says "**This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.", See https://github.com/pandas-dev/pandas/issues/44811 — nealmcb, Dec 08 '21 at 17:43
This answer has a lot of good information but it is hard to find. Including the recommended code snippets for each case in the introduction would help. For example, including `str.contains('pattern',na=False)` in the intro would be helpful. One of the two questions above this provides a different solution for ONLY this case -- so clearly this case is one of the most important to mention. I would be happy to do the edit if you like. — Josiah Yoder, Jun 29 '22 at 15:42
@JosiahYoder yes please I invite your edits as I am very inactive to do justice to post maintenance these days :( — cs95, Jul 23 '22 at 10:56

score 66 · Answer 4 · edited Feb 05 '18 at 23:08

66

If anyone wonders how to perform a related problem: "Select column by partial string"

Use:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string matching, pass axis=0 to filter:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)

edited Feb 05 '18 at 23:08

ayhan

70,170
20
182
203

answered Oct 12 '16 at 21:04

Philipp Schwarz

18,050
5
32
36

8

This can be distilled to: `df.loc[:, df.columns.str.contains('a')]` – elPastor Jun 17 '17 at 21:53
21

which can be further distilled to `df.filter(like='a')` – Ted Petrou Oct 25 '17 at 02:57
this should be an own question + answer, already 50 people searched for it... – PV8 Jan 09 '20 at 09:35
2

@PV8 question already exists: https://stackoverflow.com/questions/31551412/how-to-select-dataframe-columns-based-on-partial-matching. But when I search on google for "pandas Select column by partial string", this thread appears first – Philipp Schwarz Jan 09 '20 at 09:37

score 29 · Answer 5 · answered Apr 10 '14 at 15:36

29

Quick note: if you want to do selection based on a partial string contained in the index, try the following:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

answered Apr 10 '14 at 15:36

Christian

299
3
2

5

You can just df[df.index.to_series().str.contains('LLChit')] – Yury Bayda May 08 '15 at 21:27
1

to be even more concise, `to_series` is not needed: `df[df.index.str.contains('Hello|Britain')]` – tdy Nov 02 '21 at 12:05

score 24 · Answer 6 · answered Apr 29 '20 at 17:31

24

Should you need to do a case insensitive search for a string in a pandas dataframe column:

df[df['A'].str.contains("hello", case=False)]

answered Apr 29 '20 at 17:31

cardamom

6,873
11
48
102

score 22 · Answer 7 · answered Nov 10 '14 at 19:26

22

Say you have the following DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

You can always use the in operator in a lambda expression to create your filter.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.

answered Nov 10 '14 at 19:26

Mike

6,813
4
29
50

How do I modify above to say that x['a'] exists only in beginning of x['b']? – ComplexData Oct 18 '16 at 20:23
1

apply is a bad idea here in terms of performance and memory. See [this answer](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code). – cs95 Mar 25 '19 at 10:27

score 16 · Answer 8 · answered May 29 '21 at 08:16

16

You can try considering them as string as :

df[df['A'].astype(str).str.contains("Hello|Britain")]

answered May 29 '21 at 08:16

1

Thank you a lot, your answer helped me a lot as I was struggling to filter a dataframe via a column where the data was of bool type. Your solution helped me do the filter I needed. +1 for you. – Hasan Patel Jun 10 '21 at 20:11

score 9 · Answer 9 · edited Mar 30 '21 at 12:45

9

Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:

mask = df['ENTITY'].str.contains('DM')

df = df.loc[~(mask)].copy(deep=True)

edited Mar 30 '21 at 12:45

Niels Henkens

2,553
1
12
27

answered Mar 30 '21 at 12:06

Angeline Kingsteena

111
1
7

score 6 · Answer 10 · answered Jul 06 '12 at 17:08

6

Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

answered Jul 06 '12 at 17:08

euforia

8,635
3
14
5

3

Should be 2x to 3x faster if you compile regex before loop: regex = re.compile(regex) and then if regex.search(record) – MarkokraM Apr 10 '14 at 13:56
1

@MarkokraM https://docs.python.org/3.6/library/re.html#re.compile says that the most recent regexs are cached for you, so you don't need to compile yourself. – Teepeemm Jun 20 '18 at 19:36
Do not use iteritems to iterate over a DataFrame. It ranks last in terms of pandorability and performance – cs95 Mar 25 '19 at 10:26
iterating over dataframes defeats the entire purpose of pandas. Use Garrett's solution instead – dhruvm Jul 22 '20 at 02:12

score 5 · Answer 11 · answered Nov 20 '19 at 13:22

5

Using contains didn't work well for my string with special characters. Find worked though.

df[df['A'].str.find("hello") != -1]

answered Nov 20 '19 at 13:22

Katu

1,296
1
24
38

score 5 · Answer 12 · answered Feb 16 '21 at 09:41

A more generalised example - if looking for parts of a word OR specific words in a string:

df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

Specific parts of sentence or word:

searchfor = '.*cat.*hat.*|.*the.*dog.*'

Creat column showing the affected rows (can always filter out as necessary)

df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)

    col1             col2           TrueFalse
0   cat andhat       1000.0         True
1   hat              2000000.0      False
2   the small dog    1000.0         True
3   fog              330000.0       False
4   pet 3            30000.0        False

score 4 · Answer 13 · answered Feb 20 '20 at 13:06

Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

Warning. This method is relatively slow, albeit convenient.

score 3 · Answer 14 · answered Feb 22 '22 at 20:23

3

Somewhat similar to @cs95's answer, but here you don't need to specify an engine:

df.query('A.str.contains("hello").values')

answered Feb 22 '22 at 20:23

rachwa

1,805
1
14
17

score 2 · Answer 15 · answered Jun 01 '19 at 14:56

2

There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

This way let's you get the column you look for whatever the way is wrote.

( Obviusly, you have to write the proper regex expression for each case )

answered Jun 01 '19 at 14:56

xpeiro

733
5
21

1

This filters on the column *headers*. It isn't general, it's incorrect. – cs95 Jun 23 '19 at 05:18
@MicheldeRuiter that's still incorrect, that'd filter on index labels instead! – cs95 Dec 30 '19 at 18:35

score 2 · Answer 16 · edited Jul 19 '21 at 08:28

My 2c worth:

I did the following:

sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
    np.where(sale_method['Sale Method'].isin(['PRIVATE']),
             'private',
             np.where(sale_method['Sale Method']
                      .str.contains('AUCTION'),
                      'auction',
                      'other'
             )
    )

score 1 · Answer 17 · edited Oct 10 '22 at 06:12

1

df[df['A'].str.contains("hello", case=False)]

edited Oct 10 '22 at 06:12

buddemat

4,552
14
29
49

answered Oct 04 '22 at 11:41

usman Abbasi

69
9

2

Please consider adding an explanation to your code how it works and how it answers the OP's question. – buddemat Oct 10 '22 at 06:12

Filter pandas DataFrame by substring criteria

17 Answers17

How do I select by partial string from a pandas DataFrame?

Basic Substring Search

Multiple Substring Search

Matching Entire Word(s)

Multiple Whole Word Search

A Great Alternative: Use List Comprehensions!

More Options for Partial String Matching: `np.char.find`, `np.vectorize`, `DataFrame.query`.

Recommended Usage Precedence

Linked

Related

Filter pandas DataFrame by substring criteria

17 Answers17

How do I select by partial string from a pandas DataFrame?

Basic Substring Search

Multiple Substring Search

Matching Entire Word(s)

Multiple Whole Word Search

A Great Alternative: Use List Comprehensions!

More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.

Recommended Usage Precedence

Linked

Related

More Options for Partial String Matching: `np.char.find`, `np.vectorize`, `DataFrame.query`.