So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
| id | name_1 | name_2 | 
|---|---|---|
| 0 | sun blinds decoration paris inc. | indl de cuautitlan sa cv | 
| 1 | eih ltd. dongguan wei shi | plastic new york product co., ltd. | 
| 2 | jsh ltd. (hk) mexico city | arab shipbuilding seoul and repair yard madrid c | 
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
| id | name_1 | name_2 | 
|---|---|---|
| 0 | sun blinds decoration inc. | indl de sa cv | 
| 1 | eih ltd. wei shi | plastic product co., ltd. | 
| 2 | jsh ltd. (hk) | arab shipbuilding and repair yard c | 
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
    train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?
 
    