i have a dataframe df1
id     transactions
1      [1, 3,3,3,2,5]
2      [1,2]
root
 |-- id: int (nullable = true)
 |-- transactions: array (nullable = false)
      |-- element: string(containsNull = true)
None
i have a dataframe df2
items         cost
[1, 3,3, 5]    2
[1, 5]      1
root
|-- items: array (nullable = false)
  |-- element: string (containsNull = true)
 |-- cost: int (nullable = true)
None
i have to check whether the items are in transactions, if so sum up the costs. [1,3,3,5] in [1,3,3,3,5] is True and [1,3,3,5] in [1,2] is False and so on.
result should be
id     transactions   score
1      [1,3,3,3,5]    3
2      [1,2]          null
I tried explode and join (inner, left_semi), methods but it all fails because of duplicates. Check all the elements of an array present in another array pyspark issubset(), array_intersect() also won't work.
I came across Python - verifying if one list is a subset of the other. I found the following solves the problem and much efficient.
from collections import Counter
not Counter([1,3,3,3,5])-Counter([1,3,3,4,5])
False
>>> not Counter([1,3,3,3,5])-Counter([1,3,3,5])
False
>>> not Counter([1,3,3,5])-Counter([1,3,3,3,5])
True
i tried the following
@udf("boolean")
def contains_all(x, y):
if x is not None and y is not None:
    return not (lambda y: dict(Counter(y)))-(lambda x: dict(Counter(x)))
(df1
.crossJoin(df2).groupBy("id", "transactions")
.agg(sum_(when(
    contains_all("transactions", "items"), col("cost")
)).alias("score"))
.show())
but it throws error. File "", line 39, in contains_all TypeError: unsupported operand type(s) for -: 'function' and 'function'
any other way to achieve this?
 
    