I have a Spark Dataframe with two columns: id and hash_vector. 
The id is the id for a document and hash_vector is a SparseVector of word counts corresponding to the document (and has size 30000). There are ~100000 rows (one for each document) in the Dataframe. 
Now, I want to find similarities between every pair of documents. For this I want to compute cosine similarities from the column hash_vector. I may also want to try other similarity measures like the Jaccard index. What would be a good way of doing this? I am using PySpark. I have a few ideas:
- I can use columnSimilaritiesto find the pairwise dot products. But I read that it is more efficient for a corpus with size_of_vocabulary >> number_of_documents (which is not the case here)
- I can loop through the Dataframerows, and for the i-th row, add the i-th row as a columnnew_columnto theDataframe, then write audfwhich finds the similarity (cosine or Jaccard) on the two columns:hash_vectorandnew_column. But I read that looping through rows beats all purpose of using Spark.
- Lastly, I store only similarities above a certain threshold. Since I have a lot of documents, I can expect the matrix of similarities to be quite sparse.
I know this is a broad question. But I am interested in knowing how an expert would think of going about this. I appreciate any direction.
 
     
    