Spark Job error GC overhead limit exceeded

Question

I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.

spark.master                     spark://master:7077
spark.executor.memory            5g
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

But I am getting an error saying GC limit exceeded.

Here is the code I am working on.

import os
import sys
import unicodedata
from operator import add 

try:
    from pyspark import SparkConf
    from pyspark import SparkContext
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)


# delimeter function
def findDelimiter(text):
    sD = text[1] 
    eD = text[2] 
    return (eD, sD) 

def tokenize(text):
    sD = findDelimiter(text)[1]
    eD = findDelimiter(text)[0]
    arrText = text.split(sD)
    text = ""
    seg = arrText[0].split(eD)
    arrText=""
    senderID = seg[6].strip()
    yield (senderID, 1)


conf = SparkConf()
sc = SparkContext(conf=conf)

textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")

rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")

I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.

Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?

What is the size of data and the number of nodes in the cluster? — Sachin Janani, Jun 22 '16 at 05:26
I believe you have more than 10 GB of RAM on your node as you are assigning 5 gb to driver and 5gb to executor.Can you try setting spark.driver.memory to something 2GB — Sachin Janani, Jun 22 '16 at 12:31
And if I do that should I do that in datanode as well.. Because all the above configurations I did was in namenode alone.... — Baradwaj Aryasomayajula, Jun 22 '16 at 12:34
Also as per your new edit here namenode you mean spark master and datanode as executor right? — Sachin Janani, Jun 22 '16 at 12:34
If you have say only 6 GB RAM then 5 gb will be allocated to Driver while executor will have only 1GB left which will cause this exception — Sachin Janani, Jun 22 '16 at 12:36
@SachinJanani Okay got it... and yeah master is the name node and executor is the datanode... But I even tried datanode with 1gb already. Did not work. That is the reasin I increase it to 5g. — Baradwaj Aryasomayajula, Jun 22 '16 at 12:37
You should have more memory to executor as compared to driver as processing will be done by your executor.So executor should have say 4gb while driver can have say 2gb — Sachin Janani, Jun 22 '16 at 12:38
@SachinJanani by the way I am not facing the error when I don' t run the reduceByKey or groupByKey... — Baradwaj Aryasomayajula, Jun 22 '16 at 12:39
Yes this is because groupBykey and reduceByKey will involve shufftling — Sachin Janani, Jun 22 '16 at 12:41
I think you have very less memory free on your machine not even 2GB thats why its failing — Sachin Janani, Jun 22 '16 at 12:42
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115305/discussion-between-sachin-janani-and-baradwaj-aryasomayajula). — Sachin Janani, Jun 22 '16 at 12:42

score 2 · Answer 1 · answered Jun 22 '16 at 05:49

2

Try to add below setting for your spark-defaults.sh:

spark.driver.extraJavaOptions -XX:+UseG1GC

spark.executor.extraJavaOptions -XX:+UseG1GC

Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!

answered Jun 22 '16 at 05:49

ErhWen Kuo

1,447
1
14
17

Tried this but no luck... – Baradwaj Aryasomayajula Jun 22 '16 at 12:23

score 0 · Answer 2 · answered Jun 22 '16 at 15:21

The code you have should have worked with your configuration . As suggested earlier try using G1GC . Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less. You can set it by adding spark.storage.memoryFraction 0.4

score 0 · Answer 3 · answered Jun 27 '16 at 12:31

I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.

Spark Job error GC overhead limit exceeded

3 Answers3

Linked