I have a vocabulary of about 50,000 terms and a corpus of about 20,000 documents in a Pandas DataFrame like this:
import pandas as pd
vocab = {"movie", "good", "very"}
corpus = pd.DataFrame({
    "ID": [100, 200, 300],
    "Text": ["It's a good movie", "Wow it's very good", "Bad movie"]
})
The following code produces a SciPy CSR matrix in about 5 seconds only:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(use_idf = False, ngram_range = (1, 2),
                      vocabulary = vocab)
vec.transform(corpus["Text"])
However, converting the CSR matrix to a Pandas SparseDataFrame is so slow that I've to abort it:
dtm = pd.SparseDataFrame(vec.transform(corpus["Text"]))
dtm["ID"] = corpus["ID"]
Attempted Solutions
I tried appending .tocoo() to vec.transform(corpus["Text"]) but it makes no difference in speed. Appending .toarray() is no good either since it returns
ValueError: array is too big;
arr.size * arr.dtype.itemsizeis larger than the maximum possible size
I also tried SparseSeries as suggested at stackoverflow.com/q/17818783 but it resulted in a MemoryError:
tfidf = vec.transform(corpus["Text"])
dtm = pd.SparseDataFrame( [pd.SparseSeries(tfidf[k].toarray().ravel())
                           for k in range(tfidf.shape[0])] )
The MemoryError cannot be resolved by changing the list comprehension to a generator expression, because the latter returns UnboundLocalError: local variable 'mgr' referenced before assignment
I need a SparseDataFrame because I want to join / merge the ID column with another DataFrame. Is there a faster way?