I have two DataFrames A and B:
Ahas columns(id, info1, info2)with about 200 Million rowsBonly has the columnidwith 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?

