My question is not a duplicate of [Joining Spark Dataframes with "isin" operator. My question is about "not in", not "is in". It is DIFFERENT!
I have two Datasets:
userProfileDataset:Dataset[UserProfile]jobModelsDataset:Dataset[JobModel]
Case clss UserProfile is defined as
case class UserProfile(userId: Int, visitedJobIds: Array[Int])
and case class JobModel is defined as
case class JobModel(JobId: Int, Model: Map[String, Double])
I have also made two objects (UserProfileFieldNames and JobModelFieldNames) that contains the field names of these two case classes.
My objective is, for each user in userProfileDataset, find the JobModel.JobIds that are NOT contained in UserProfile.visitedJobIds.
How to do this?
I've thought about using a crossJoin and then filter. It may work. Is there more direct or efficient ways?
I have tried the following approaches, but none of them worked:
val result = userProfileDataset.joinWith(jobModelsDataset,
!userProfileDataset.col(UserProfileFieldNames.visitedJobIds).contains(jobModelsDataset.col(JobModelFieldNames.jobId)),
"left_outer"
)
It leads to:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'contains(
_1.visitedJobIds, CAST(_2.JobIdAS STRING))' due to data type mismatch: argument 1 requires string type, however, '_1.visitedJobIds' is of array type.;;
Could it be because the contains method can be only used for testing whether one string contains another string?
The following condition also didn't work:
!jobModelsDataset.col(JobModelFieldNames.jobId)
.isin(userProfileDataset.col(UserProfileFieldNames.visitedJobIds))
It leads to:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '(
_2.JobIdIN (_1.visitedJobIds))' due to data type mismatch: Arguments must be same type but were: IntegerType != ArrayType(IntegerType,false);; 'Join LeftOuter, NOT _2#74.JobId IN (_1#73.visitedJobIds)