I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.
spark.master                     spark://master:7077
spark.executor.memory            5g
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
But I am getting an error saying GC limit exceeded.
Here is the code I am working on.
import os
import sys
import unicodedata
from operator import add 
try:
    from pyspark import SparkConf
    from pyspark import SparkContext
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)
# delimeter function
def findDelimiter(text):
    sD = text[1] 
    eD = text[2] 
    return (eD, sD) 
def tokenize(text):
    sD = findDelimiter(text)[1]
    eD = findDelimiter(text)[0]
    arrText = text.split(sD)
    text = ""
    seg = arrText[0].split(eD)
    arrText=""
    senderID = seg[6].strip()
    yield (senderID, 1)
conf = SparkConf()
sc = SparkContext(conf=conf)
textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")
rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")
I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.
Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?
 
     
    