Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

Question

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.

I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,

property_sql_df.show() shows,

+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+

What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.

Expected Output

{
  "90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
  "d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}

What I have tried so far,

%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id]

What I get now on the console:

java.lang.OutOfMemoryError: Java heap space

and showing this error on Jupyter notebook output

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)

Not sure if it ultimately help, but as a first step you might try to distribute list collection as `df.groupby("hash_of_cc").agg(collect_list("id")).show()`. And then take another hard look and decide whether you really need to collect your full dataframe to the driver. — mazaneicha, Jul 26 '20 at 18:00

score 2 · Answer 1 · answered Jul 26 '20 at 17:50

2

making a dictionary from pyspark dataframe should not be very costly

This is true in terms of runtime, but this will easily take up a lot of space. Especially if you're doing property_sql_df.collect(), at which point you're loading your entire dataframe into driver memory. At 6.5M rows, you'll already hit 65GB if each row has 10KB, or 10K characters, and we haven't even gotten to the dictionary yet.

First, you can collect just the columns you need (e.g. not name). Second, you can do the aggregation upstream in Spark, which will save some space depending on how many ids there are per hash_of_cc_pn_li:

rows = property_sql_df.groupBy("hash_of_cc_pn_li") \
  .agg(collect_set("id").alias("ids")) \
  .collect()

duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }

answered Jul 26 '20 at 17:50

Eric Doi

56
3

Thanks for the elaborated answer sir. – A l w a y s S u n n y Jul 26 '20 at 18:13
Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? – A l w a y s S u n n y Jul 26 '20 at 18:17
1

`df.toPandas()` is even worse than `df.collect()` in terms of how much space it's going to take up. What to do instead depends on what you're trying to do after this. If, say, the point is just to have a csv file with these mappings, then you can just go directly from the dataframe (after using `collect_list`/`collect_set`) and save it to a file via `df.coalesce(1).write.option('sep',';').csv(path)`. – Eric Doi Jul 26 '20 at 18:27
No luck same issue occuring – A l w a y s S u n n y Jul 27 '20 at 05:39
followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror – A l w a y s S u n n y Jul 27 '20 at 06:29

score 1 · Answer 2 · answered Jul 26 '20 at 17:54

1

Here's how to make a sample DataFrame with your data:

data = [
    ("BOND-9129450", "90cb"),
    ("BOND-1742850", "d5c3"),
    ("BOND-3211356", "811f"),
    ("BOND-7630290", "d5c3"),
    ("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])

Let's aggregate the data in a Spark DataFrame to limit the number of rows that are collected on the driver node. We'll use the two_columns_to_dictionary function defined in quinn to create the dictionary.

agg_df = df.groupBy("hash_of_cc_pn_li").agg(F.max("hash_of_cc_pn_li").alias("hash"), F.collect_list("id").alias("id"))
res = quinn.two_columns_to_dictionary(agg_df, "hash", "id")
print(res) # => {'811f': ['BOND-3211356'], 'd5c3': ['BOND-1742850', 'BOND-7630290'], '90cb': ['BOND-9129450', 'BOND-7175508']}

This might work on a relatively small, 6.5 million row dataset, but it won't work on a huge dataset. "I think making a dictionary from pyspark dataframe should not be very costly" is only true for DataFrames that are really tiny. Making a dictionary from a PySpark DataFrame is actually very expensive.

PySpark is a cluster computing framework that benefits from having data spread out across nodes in a cluster. When you call collect all the data is moved to the driver node and the worker nodes don't help. You'll get an OutOfMemory exception whenever you try to move too much data to the driver node.

It's probably best to avoid the dictionary entirely and figure out a different way to solve the problem. Great question.

answered Jul 26 '20 at 17:54

Powers

18,150
10
103
108

Thanks for the elaborated answer sir. – A l w a y s S u n n y Jul 26 '20 at 18:13
Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? – A l w a y s S u n n y Jul 26 '20 at 18:17
1

@AlwaysSunny - Yep, toPandas also collects all the data on the driver node and should be avoided unless the dataset is tiny. – Powers Jul 26 '20 at 18:41
No luck same issue occuring – A l w a y s S u n n y Jul 27 '20 at 05:39
@AlwaysSunny - Yep, not surprising, that's a common error whenever you try to collect a large dataset. Feel free to ask another question that explains what you're trying to accomplish with the dictionary and I can see if there is another way to solve your problem with another approach. – Powers Jul 27 '20 at 05:43
Ok sir got your point, But with simple 5000 records it is casuing the same error. – A l w a y s S u n n y Jul 27 '20 at 06:07
followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror – A l w a y s S u n n y Jul 27 '20 at 06:29

notNull · Answer 3 · 2020-07-26T18:22:30.773

From Spark-2.4 we can use groupBy,collect_list,map_from_arrays,to_json built in functions for this case.

Example:

df.show()
#+------------+-----------------+
#|          id| hash_of_cc_pn_li|
#+------------+-----------------+
#|BOND-9129450|90cb0946cf4139e12|
#|BOND-7175508|90cb0946cf4139e12|
#|BOND-1742850|d5c301f00e9966483|
#|BOND-7630290|d5c301f00e9966483|
#+------------+-----------------+
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
selectExpr("to_json(map_from_arrays(array(hash_of_cc_pn_li),array(id))) as output").\
show(10,False)
#+-----------------------------------------------------+
#|output                                               |
#+-----------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"]}|
#|{"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+-----------------------------------------------------+

To get one dict use another agg with collect_list.

df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
agg(to_json(map_from_arrays(collect_list(col("hash_of_cc_pn_li")),collect_list(col("id")))).alias("output")).\
show(10,False)
#+---------------------------------------------------------------------------------------------------------+
#|output                                                                                                   |
#+---------------------------------------------------------------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"],"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+---------------------------------------------------------------------------------------------------------+

Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? — A l w a y s S u n n y, Jul 26 '20 at 18:17
@AlwaysSunny, Try to avoid converting to pandas dataframe and if possible use spark in built functions to get the output expected. pandas collects the data and causes driver failure. https://stackoverflow.com/questions/39811054/pyspark-simple-re-partition-and-topandas-fails-to-finish-on-just-600-000-rows — notNull, Jul 26 '20 at 18:27
followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror — A l w a y s S u n n y, Jul 27 '20 at 06:29

score 1 · Accepted Answer · answered Jul 27 '20 at 09:59

1

Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:

https://stackoverflow.com/a/63111765/12378881

answered Jul 27 '20 at 09:59

Napoleon Borntoparty

1,870
1
8
28

Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space

4 Answers4

Linked