I'm new to Spark, and I found the Documentation says Spark will will load data into memory to make the iteration algorithms faster.
But what if I have a log file of 10GB and only have 2GB memory ? Will Spark load the log file into memory as always ?
I'm new to Spark, and I found the Documentation says Spark will will load data into memory to make the iteration algorithms faster.
But what if I have a log file of 10GB and only have 2GB memory ? Will Spark load the log file into memory as always ?
I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html):
The key here is noting that RDDs are split in partitions (see how at the end of this answer), and each partition is a set of elements (can be text lines or integers for instance). Partitions are used to parallelize computations in different computational units.
So the key is not whether a file is too big but whether a partition is. In this case, in the FAQ: "Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data". The issue with large partitions generating OOM is solved here.
Now, even if the partition can fit in memory, such memory can be full. In this case, it evicts another partition from memory to fit the new partition. Evicting can mean either:
Memory management is well explained here: "Spark stores partitions in LRU cache in memory. When cache hits its limit in size, it evicts the entry (i.e. partition) from it. When the partition has “disk” attribute (i.e. your persistence level allows storing partition on disk), it would be written to HDD and the memory consumed by it would be freed, unless you would request it. When you request it, it would be read into the memory, and if there won’t be enough memory some other, older entries from the cache would be evicted. If your partition does not have “disk” attribute, eviction would simply mean destroying the cache entry without writing it to HDD".
How the initial file/data is partitioned depends on the format and type of data, as well as the function used to create the RDD, see this. For instance:
Finally, I suggest you reading this for more information and also to decide how to choose the number of partitions (too many or too few?).
It will not load the full 10G, as you don't have enough memory available. From my experience, one of three will happen depending on how you use your data:
If you are trying to cache the 10GBs:
If you are just processing the data:
Of course, this is highly related to your code and the transformations you are applying.