- I have a list of strings containing 50 million search queries. [1-500+ words in each query].
- I also have a list of strings containing 500 words and phrases I need to return indices of search queries (1) that contain any word or phrase (2).
The goal is to only keep queries related to a certain topic (movies) and then use NLP to cluster these filtered queries (stemming -> tf_idf -> pca -> kmeans).
I tried to filter queries using nested loops, but it would take more than 10 hours to finish.
filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        query, timestamp = line.strip().split('\t')
        for word in key_words:
            if word in query:
                filtered.append(i)
I looked into solutions which use regex (word1|word2|...|wordN), but the problem is that i cannot combine queries into a large string since i need to filter irrelevant queries.
UPDATE: examples of logs and keywords
search_logs.txt
'query  timestamp\n'
'the dark knight    2019-02-17 19:05:12\n'
'how to do a barrel roll    2019-02-17 19:05:13\n'
'watch movies   2019-02-17 19:05:13\n'
'porn   2019-02-17 19:05:13\n'
'news   2019-02-17 19:05:14\n'
'rami malek 2019-02-17 19:05:14\n'
'Traceback (most recent call last): File "t.py" 2019-02-17 19:05:15\n'
.......... # millions of other search queries
key_words = [
    'movie',
    'movies',
    'cinema',
    'oscar',
    'oscars',
    'george lucas',
    'ben affleck',
    'netflix',
    .... # hundreds of other words and phrases
]
 
     
    