I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X. Each row is a sample vector in a 2,000,000 dimensional space.
I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances function with a subset of vectors in X which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist function.
I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist as it requires a dense matrix X which does not again fit in memory. I thought about looping through different chunks of X and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?
Any help is much much appreciated. Thanks in advance.
Below is a reproducible example (of course for demonstration purposes X is much smaller):
from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')
As you see dist2 is in the condensed form and is a 499500 dimensional vector. But dist1 is in the symmetric squareform and is a 1000x1000 matrix.