Is it possible to join two (Pair)RDDs (or Datasets/DataFrames) (on multiple fields) using some "custom criteria"/fuzzy matching, e.g. range/interval for numbers or dates and various "distance methods", e.g. Levenshtein, for strings?
For "grouping" within an RDD to get a PairRDD, one can implement a PairFunction, but it seems that something similar is not possible when JOINing two RDDs/data sets? I am thinking something like:
rdd1.join(rdd2, myCustomJoinFunction);
I was thinking about implementing the custom logic in hashCode() and equals() but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup() but have not figured out how I could use it to implement this.
I just came across elasticsearc-hadoop. Does anyone know if that library could be used to do something like this?
I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.
PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).