I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df that is connected to this table.
I want to invoke the selectExpr function to calculate the mean, something like:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
which is also applicable to all other columns.
But when I run it, it gives this error:
Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I know this happens because, in common SQL rules, I didn't put a GROUP BY clause for columns contained in the aggregate function (mean). How do I put the GROUP BY to the invoke method?
Previously, I manage to do complete the purpose using another way, which is by:
- Calculate the
meanof each column bysummarize_all - Collect the
meaninside R - Apply this mean using
invokeandselectExpr
as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.
My Spark version is 1.6.0