Sorry for a newbie question.
Currently I have log files which contains fields such as: userId, event, and timestamp, while lacking of the sessionId. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT.
If the TIMEOUT value is 10, and sample DataFrame is:
scala> eventSequence.show(false)
  +----------+------------+----------+ 
  |uerId     |event       |timestamp |
  +----------+------------+----------+ 
  |U1        |A           |1         | 
  |U2        |B           |2         |
  |U1        |C           |5         |
  |U3        |A           |8         |
  |U1        |D           |20        |
  |U2        |B           |23        |
  +----------+------------+----------+
The goal is:
  +----------+------------+----------+----------+
  |uerId     |event       |timestamp |sessionId |
  +----------+------------+----------+----------+
  |U1        |A           |1         |S1        |
  |U2        |B           |2         |S2        |
  |U1        |C           |5         |S1        |
  |U3        |A           |8         |S3        |
  |U1        |D           |20        |S4        |
  |U2        |B           |23        |S5        |
  +----------+------------+----------+----------+
I find one solution in R (Create a "sessionID" based on "userID" and differences in "timeStamp"), while I am not able to figure it out in Spark.
Thanks for any suggestions on this problem.
 
     
    