What hashing function does Spark use for HashingTF and how do I duplicate it?

Question

Spark MLLIb has a HashingTF() function that computes document term frequencies based on a hashed value of each of the terms.

1) what function does it use to do the hashing?

2) How can I achieve the same hashed value from Python?

3) If I want to compute the hashed output for a given single input, without computing the term frequency, how can I do this?

I am not sure if I understand what you mean by "hashed output for a given single input, without computing the term frequency" here. Do you mean something like computing hash for `set(document)`? — zero323, Jul 21 '15 at 13:48
Yes, given a string S, I'd like a quick way to find the hashed(S) value without having to instantiate and use the HashingTF() function in Spark. — gallamine, Jul 21 '15 at 13:49

zero323 · Accepted Answer · 2015-07-21T14:02:12.233

6

If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

def indexOf(self, term):
    """ Returns the index of the input term. """
    return hash(term) % self.numFeatures

As you can see it is just a plain old hash module number of buckets.

Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

def transform(self, document):
    freq = {}
    for term in document:
        i = self.indexOf(term)
        freq[i] = freq.get(i, 0) + 1.0
    return Vectors.sparse(self.numFeatures, freq.items())

If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.

edited Jul 21 '15 at 14:02

answered Jul 21 '15 at 13:47

zero323

322,348
103
959
935

Thanks Zero323. I guess i was under the impression that the HashingTF was implemented in Java. Thanks! – gallamine Jul 21 '15 at 13:51
Most of the functions in MLlib are implemented natively and operate on native data structures. For example `Vectors` are just the wrappers for `numpu.ndarray`. – zero323 Jul 21 '15 at 13:56
Interesting anwser. I would like to find which term correspond to a given hash (I'm running TF-IDF then wan't to find out most important terms). It returns (hash, tfidf) tuple, any idea how i could get (term, tfidf) ? – pltrdy Mar 23 '16 at 17:34
@pltrdy You cannot, or at least not in a general case. If you want a reversible transformation take a look at the count vectorizer. See my answer to http://stackoverflow.com/a/32286619/1560062 – zero323 Mar 23 '16 at 17:41

Josh Baker · Answer 2 · 2015-11-14T00:18:45.247

0

It seems to me that there is something else going on under the hood other than what the source that zero323 linked. I found that hashing and then doing the modulus as the source code did wouldn't give me the same indices as hashingTF generates. At least for single characters, what I had to do was convert the char to the ascii code, like so: (Python 2.7)

index = ord('a') # 97

Which corresponds to what hashingtf outputs for the index. If I did the same thing as hashingtf appears to do, which is:

index = hash('a') % 1<<20 # 897504

I would get very clearly the wrong index.

edited Nov 14 '15 at 00:18

answered Nov 13 '15 at 22:20

Josh Baker

598
7
16

[Operator precedence](https://docs.python.org/3/reference/expressions.html#operator-precedence): `assert HashingTF(numFeatures=1 << 20).indexOf("a") == (hash("a") % (1 << 20))`. – user7337271 Dec 28 '16 at 04:12

What hashing function does Spark use for HashingTF and how do I duplicate it?

2 Answers2

Linked