i have the following Spark DataFrame :
agent_product_sale=data.frame(agent=c('a','b','c','d','e','f','a','b','c','a','b'),
                         product=c('P1','P2','P3','P4','P1','p1','p2','p2','P2','P3','P3'),
                         sale_amount=c(1000,2000,3000,4000,1000,1000,2000,2000,2000,3000,3000))
RDD_aps=createDataFrame(sqlContext,agent_product_sale)
   agent product sale_amount
1      a      P1        1000
2      b      P1        1000
3      c      P3        3000
4      d      P4        4000
5      d      P1        1000
6      c      P1        1000
7      a      P2        2000
8      b      P2        2000
9      c      P2        2000
10     a      P4        4000
11     b      P3        3000
I need to group the Spark DataFrame by agent and for each agent find the product with highest sale_amount
      agent  most_expensive
      a           P4        
      b           P3                
      c           P3        
      d           P4        
I use the following code but it would return the maximum sale_amount for each agent
schema <-  structType(structField("agent", "string"),
 structField("max_sale_amount", "double"))
result <- gapply(
RDD_aps,
c("agent"),
function(key, x) {
y <- data.frame(key,max(x$sale_amount), stringsAsFactors = FALSE)
}, schema)
 
     
     
    