I have two jobs that do exactly the same.
One is in Hive and the other in Spark. The only difference in the results is that one of the columns is a string that is hashed. So, The results are different in hive and Spark when calling hash().
I do understand that different libraries are used. but I was wondering (if possible) how could Spark be configured to produce the same results as in hive?
Is it possible to figure out the hashing function (e.g. murmur3) and use it in both engines?
Perhaps it's possible to create a Spark udf to produce the same result as the hive hash() function?