Let's imagine we have the following dataframe :
port | flag | timestamp
---------------------------------------
20  | S    | 2009-04-24T17:13:14+00:00
30  | R    | 2009-04-24T17:14:14+00:00
32  | S    | 2009-04-24T17:15:14+00:00
21  | R    | 2009-04-24T17:16:14+00:00
54  | R    | 2009-04-24T17:17:14+00:00
24  | R    | 2009-04-24T17:18:14+00:00
I would like to calculate the number of distinct port, flag over the 3 hours in Pyspark.
The result will be something like :
port | flag | timestamp | distinct_port_flag_overs_3h
---------------------------------------
20   | S    | 2009-04-24T17:13:14+00:00 | 1
30   | R    | 2009-04-24T17:14:14+00:00 | 1
32   | S    | 2009-04-24T17:15:14+00:00 | 2
21   | R    | 2009-04-24T17:16:14+00:00 | 2
54   | R    | 2009-04-24T17:17:14+00:00 | 2
24   | R    | 2009-04-24T17:18:14+00:00 | 3
The SQL request looks like :
SELECT     
COUNT(DISTINCT port) OVER my_window AS distinct_port_flag_overs_3h
FROM my_table
WINDOW my_window AS (
    PARTITION BY flag
    ORDER BY CAST(timestamp AS timestamp)
    RANGE BETWEEN INTERVAL 3 HOUR PRECEDING AND CURRENT
)
I found this topic that solves the problem but only if we want to count distinct elements over one field.
Do someone has any idea of how to achieve that in :
- python 3.7 
- pyspark 2.4.4 
 
    