For example, i'm using Kafka as a source using Structured Streaming.
Update: As per the documentation, one cannot set the auto.offset.reset param https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations Is there any specific reason why Structured Streaming does not leverage the auto.offset.reset value as None, i hope that could resolve the data loss by default.
Then where does Structured streaming stores the consumed offsets?
Since the startingOffsets default value is latest what happens if i have to deploy new code or re-start the application because of some failures? in this case if the value is latest then there will be data loss.
Do i still have to do something like checkpointing and write-ahead-logs etc to handle the failure?
As per this link, even after check-pointing and write-ahead-logs there will be problems in recovery semantics. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query
Can we deploy the new changes with same check-point location? when we tried with old Direct approach, it does not allow you to deploy the new changes with same check-point location.