0

I have a serious use-case with my application.

In the first part of it, I am processing data logs outing from kafka to build connection sessions (spark-streaming). Those sessions are checkpoint-ed and kept in-memory (mapWithState()) for further real-time processing i.e. the application's second part.

The problem comes up with a batch branch I want to add to the application. Its purpose is the same as the aforementioned first part i.e. building sessions - but this time, based on log files in a directory.

AIM: I want to be able to suspend my real time processing automatically as soon as I get a file in the batch directory, process it, feed the newly calculated sessions to the application (i.e. to the second part), then take over the real time processing where it left off.

Have you any idea how to do it?


EDIT: Basic Application scheme.

The red circle points out the problem location. As the image shows, I do have two inputs, one from kafka for real-time processing and one other from a specific directory for batch processing. As shown above, the part 2 i.e Process Sessions, is shared between the two processing methods, indeed we have to keep in-memory the data of each session.

The batch has the role of a corrective loop-back, thus enforcing to be processed prior to any real-time session implying the exclusive clause mentioned by rakesh.

A.Duval
  • 11
  • 3
  • Your question is not clear!!! Do you mean, 1) you have two source of input, one from kafka another from directory? 2) why do you have to make the both mutually exclusive? – rakesh Jul 28 '16 at 16:33
  • 1
    There is no "pause button" in streaming if this is what you ask. You can kill streaming and then recover for checkpoint but this seems like nuclear option. If "batch" data comes periodically why not treat it as just another stream? http://stackoverflow.com/q/37447393/1560062 – zero323 Aug 02 '16 at 13:44

0 Answers0