I am working with a tbl_spark in sparklyr.
I have a spark Dataframe with two list-type columns, and I would like to output two things:
- The intersection of both lists (as a list)
- The number of elements in the intersection
My input data looks something like the following (using the mtcars dataset) where "sc" is my spark connection:
library(dplyr)      
library(sparklyr)
## Load mtcars into spark with connection "sc"
mtcars_spark <- copy_to(sc, mtcars)
## Wrangle mtcars to get list columns using ft_regex_tokenizer()
tbl_with_lists <- mtcars_spark %>%
  mutate(mpg_rounded = round(mpg, -1)) %>%
  group_by(mpg_rounded) %>%
    summarize(
      cyl_all = paste(collect_set(as.character(cyl)), sep = ", "),
      gear_all = paste(collect_set(as.character(gear)), sep = ", ")
    ) %>%
  ungroup() %>%
  ft_regex_tokenizer("cyl_all", "cyl_list", pattern = "[,]\\s*") %>%
  ft_regex_tokenizer("gear_all", "gear_list", pattern = "[,]\\s*")
tbl_with_lists
## # Source: spark<?> [?? x 5]
##   mpg_rounded cyl_all       gear_all      cyl_list   gear_list 
##         <dbl> <chr>         <chr>         <list>     <list>    
## 1          10 8.0           3.0           <list [1]> <list [1]>
## 2          30 4.0           5.0, 4.0      <list [1]> <list [2]>
## 3          20 8.0, 6.0, 4.0 5.0, 3.0, 4.0 <list [3]> <list [3]>
I haven't had much success with finding out how to do this. Any ideas?
