Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])
 
     
     
     
    