Yes, you can! It is going to be a little bit messy so let me construct in a few steps:
First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):
- We will use
re.findall(pattern, string) to find all the occurrences of pattern in string
- The regex pattern will basically say "any of the words":
- for the "any" we use the
| (or) operator
- for constructing words we need to use the parenthesis to group together... However, parenthesis
(word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)
import re
re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']
Now, we have to construct the pattern each time:
- Split into words
- Wrap each word inside a non-capturing group
(?:)
- Join all of these groups by
|
re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')
One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
return re.findall(pattern, employee)
And then we can apply the same way:
df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)
One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.
EDIT: Removing punctuation marks from vendors
You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.
So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) characters by a '' (removing them).
EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or @)
- and the ?, which means the previous symbol is optional...
So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.