I have two large datasets,
val dataA : Dataset[TypeA] and
val dataB: Dataset[TypeB], where both TypeA and TypeB extend Serializable.
I want to join the two datasets on separate columns, so where TypeA.ColumnA == TypeB.ColumnB. Spark offers the function JoinWith on a dataset, which I think will do this properly, but the function is undocumented and marked as "experimental".
The other approach I have looked at is to use PairRDDs instead of datasets, and join them using a common key (like it says in this stackoverlow post here: how to join two datasets by key in scala spark).
Is there a specifically better approach to joining two datasets, or using either JoinWith or PairRDDs the best way?