I am currently building a Spark application and I'd like to log some stats from my intermediary RDDs. All I need is the size of the RDDs at different steps of the transformations.
The transformations are linear, therefore I don't need to cache anything, except to call .count() (which doesn't need to be accurate nor synchronous).
I could use countAsync to become asynchronous:
val rawData = sc.textFile(path)
val (values, malformedRows) = parse(rawData)
values.countAsync.onComplete(size => logger.info(s"${malformedRows.count} malformed rows while parsing ${path} (${size} successfully parsed)"))
val result = values
.map(operator)
.filter(predicate)
result.countAsync.onComplete(size => logger.info(s"${size} final rows"))
result.saveAsTextFile(output)
However, my understanding is that the countAsync will trigger a new action (as count does). And since my RDDs will eventually be computed (by the last saveAsTextFile action), I'm wondering if it is possible to capitalize on the last action in order to log the stats without creating new actions.
Is there any way to have an approximate asynchronous count without triggering a new action? In other word to have a lazy count()?