PySpark vs sklearn TFIDF

Question

I'm new to PySpark. I was playing around with the tfidf. Just wanted to check if they are giving same results. But they are not same. Here's what I did.

# create the PySpark dataframe
sentenceData = sqlContext.createDataFrame((
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

# tokenize
tokenizer = Tokenizer().setInputCol("sentence").setOutputCol("words")
wordsData = tokenizer.transform(sentenceData)

# vectorize
vectorizer = CountVectorizer(inputCol='words', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)

# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)

# dense the current response variable
def to_dense(in_vec):
    return DenseVector(in_vec.toArray())
to_dense_udf = udf(lambda x: to_dense(x), VectorUDT())

# create dense vector
wordsData = wordsData.withColumn("tfidf_features_dense", to_dense_udf('tfidf_features'))

I converted the PySpark df to pandas

wordsData_pandas = wordsData.toPandas()

and, then just calculating using sklearn's tfidf as following

def dummy_fun(doc):
    return doc

# create sklearn tfidf
tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  

# transform and get idf scores
feature_matrix = tfidf.fit_transform(wordsData_pandas.words)

# create sklearn dtm matrix
sklearn_tfifdf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())

# create PySpark dtm matrix
spark_tfidf = pd.DataFrame([np.array(i) for i in wordsData_pandas.tfidf_features_dense], columns=vectorizer.vocabulary)

But unfortunately, I'm getting this for PySpark

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>i</th>      <th>are</th>      <th>logistic</th>      <th>case</th>      <th>spark</th>      <th>hi</th>      <th>about</th>      <th>neat</th>      <th>could</th>      <th>regression</th>      <th>wish</th>      <th>use</th>      <th>heard</th>      <th>classes</th>      <th>java</th>      <th>models</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0.287682</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>    </tr>    <tr>      <th>1</th>      <td>0.287682</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>    </tr>    <tr>      <th>2</th>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>    </tr>  </tbody></table>

and this for sklearn,

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>i</th>      <th>are</th>      <th>logistic</th>      <th>case</th>      <th>spark</th>      <th>hi</th>      <th>about</th>      <th>neat</th>      <th>could</th>      <th>regression</th>      <th>wish</th>      <th>use</th>      <th>heard</th>      <th>classes</th>      <th>java</th>      <th>models</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0.355432</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.467351</td>      <td>0.467351</td>      <td>0.467351</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.467351</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>    </tr>    <tr>      <th>1</th>      <td>0.296520</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.389888</td>      <td>0.000000</td>    </tr>    <tr>      <th>2</th>      <td>0.000000</td>      <td>0.447214</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.447214</td>    </tr>  </tbody></table>

I did try out the use_idf, smooth_idf paramters. But none seem to make both same. What am I missing? Any help is appreciated. Thanks in advance.

HakunaMaData · Accepted Answer · 2018-12-29T07:12:57.553

11

That's because the IDFs are calculated a little differently between the two.

From sklearn's documentation:

Compare to pyspark's documentation:

Besides the addition of the 1 in the IDF the sklearn TF-IDF uses the l2 norm which pyspark doesn't

TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

edited Dec 29 '18 at 07:12

answered Dec 28 '18 at 21:27

HakunaMaData

1,281
12
26

@HakunaMaData, we can get the same result with sklearn, if we set norm to None in Sklearn. – Venkatachalam Jan 06 '19 at 13:30

Venkatachalam · Answer 2 · 2019-01-06T13:29:25.850

Both Python and Pyspark implementation of tfidf scores are the same. Refer the same Sklearn document but on following line,

The key difference between them is that Sklearn uses l2 norm by default, which is not the case with Pyspark. If we set the norm to None, we will get the same result in sklearn as well.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]

def dummy_fun(doc):
    return doc

tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
                                tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)

tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df

Refer to my answer here to understand how norm works with tf-idf vectorizer.

PySpark vs sklearn TFIDF

2 Answers2

Linked