I have two list objects: wiki_text and corpus. wiki_text is made up of small phrases and corpus is made up of long sentences.
wiki_text = ['never ending song of love - ns.jpg',
 'ecclesiological society',
 "1955-56 michigan wolverines men's basketball team",
 'sphinx strix',
 'petlas',
 '1966 mlb draft',
 ...]
corpus = ['Substantial progress has been made in the last twenty years',
          'Patients are at risk for prostate cancer.',...]
My goal is to create a filter which can filter out elements in wiki_text that is a substring of the elements in corpus. For example, if 'ecclesiological society' exists as part of a sentence in corpus, it should be kept in the final result. The final result should be a subset of the original wiki_text. The following code is what I used before:
def wiki_filter(wiki_text, corpus):
    result = []
    for i in wiki_text:
        for e in corpus:
            if i in e:
                result.append(i)
                break
    return result
However, given the length of wiki_text and corpus (each > 10 million). This function took extremely long hours to run. Is there any better way to solve this problem?
 
     
     
    