I'm on Spark 2.2.0, running on EMR.
I have a big dataframe df (40G or so in compressed snappy files) which is partitioned by keys k1 and k2.
When I query by k1 === v1 or (k1 === v1 && k2 ===v2`), I can see that it's only querying the files in the partition (about 2% of the files).
However if I cache or persist df, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant.
This is a big surprise - is there any way to do caching which preserves the partitoning information