I have a Dataset on groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value")) returns a dataset which has the grouped data for four columns and max of time column and collect_list which has all the values for that grouped data like [5,1] but all I want for _6 is the value matching all the grouped columns and also max("_5").as("time") not only for the grouped columns
Code below:
val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"),("thing1",1,1,"Temperature",1551502200000L,"1"))
 import org.apache.spark.sql.functions._
 val dataSet = spark.sparkContext.parallelize(data)
 import spark.implicits._
 val testDS = dataSet.toDS()
 testDS.groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value")).show()
Output:
 |  _1     |  _2  |  _3  |  _4        |  time          |  value  |
 |thingId1 |  1   |  1   |Temperature |  1551502200000 | [5,1]   |
Required Output
 |  _1     |  _2  |  _3  |  _4        |  time          |  value  |
 |thingId1 |  1   |  1   |Temperature |  1551502200000 | 1       |
I don't want value 5 to be in the value column as its not comes under the criteria max("time") all I need is 1 in value column as the it only matches the conditions of all the grouped columns and max("time").
How to achieve this.
Thank you.
 
     
    