I have these 2 Spark tables:
simx
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and
simy
y0: num 1.00 2.00 3.00 ...
In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB.
I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ).
I searched everywhere and I don't think there's any ready-to-use cor function, so I'm thinking about using the correlation formula itself (just like mentioned in here).
Based on a good explanation in my previous question, I think using mutate_all or mutate_each is not very efficient and gives a C stack error for a bigger data size, so I consider to use invoke instead to call functions from Spark directly.
So far I managed to get until here:
exprs <- as.list(paste0("sum(", colnames(simX_tbl),")"))
corr_result <- simX_tbl%>%
spark_dataframe() %>%
invoke("selectExpr", exprs) %>%
invoke("toDF", as.list(colnames(simX_tbl))) %>%
sdf_register("corr_result")
to calculate the sum of each column in simx. But then, I realize that I also need to calculate the simy table and I don't know how to interact the two tables together (like, accessing simy while manipulating simx).
Is there any way to calculate the correlation in a better way? Or maybe just how to interact with other Spark table.
My Spark version is 1.6.0
EDIT:
I tried to use combine function from dplyr:
xy_df <- simX_tbl %>%
as.data.frame %>%
combine(as.data.frame(simY_tbl)) %>%
# convert both table to dataframe, then combine.
# It will become list, so need to convert to dataframe again
as.data.frame
xydata <- copy_to(sc, xy_df, "xydata") #copy the dataframe into Spark table
But I'm not sure if this is a good solution because:
- Need to load into dataframe inside of R, which I consider non-practical for big size data
When trying to
headthe handlexydata, the column name becomes a concat of all valuesxydata %>% head Source: query [6 x 790] Database: spark connection master=yarn-client app=sparklyr local=FALSEc_1_67027262134984_2_44919662134984_1_85728542134984_1_49317262134984_
1 1.670273
2 2.449197
3 1.857285
4 1.493173
5 1.576857
6 -5.672155