I am streaming a lot of data (200k + events / batch of 3sec) from Kafka using KafkaUtils Pyspark implementation.
I receive live data with :
- a
sessionID - an
ip - a
state
What I am doing for now with a basic Spark/Redis implementation is the following:
Spark job :
- aggregate the data by
sessionID:rdd_combined = rdd.map(map_data).combineByKey(lambda x: frozenset([x]), lambda s, v: s | frozenset([v]), lambda s1, s2: s1 | s2) - create a
setof the differentstate( that could be 1, 2, 3...) - keep the
ipinformation to then transform it into alon/lat. - check if the
sessionIDis in Redis, if yes updates it else writes it to Redis.
Then I run a small script only for Redis in Python that checks if there is a 1 in the state :
- if yes, the event is published in a channel (say
channel_1) and deleted from Redis. - if not we check / update a timestamp. If
NOW() - timestamp > 10 minthe data is published inchannel_2or else we do nothing.
Question :
I keep wondering what would be the best implementation to compute most of the work with Spark.
- using
window+ an aggregation orreduceByKeyAndWindow: my fear is that if I use a window of 10 minutes and do the computation every 3secs over the almost same data it is not very efficient. - using
updateStateByKeyseems interesting but the data is never deleted and this could become problematic. Also how could I check we are past the 10 minutes ?
Any thoughts on this implementation or others that could be possible ?