How do I replace a string value with a NULL in PySpark?

Question

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL, but it does not accept None as an argument. How can I do this?

zero323 · Answer 1 · 2016-04-27T23:48:36.503

76

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])

def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## |  x|   y|
## +---+----+
## |  1| foo|
## |  2|null|
## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

edited Apr 27 '16 at 23:48

answered Apr 27 '16 at 18:37

zero323

322,348
103
959
935

7

Excellent, much more efficient for large datasets. – cftarnas Sep 23 '16 at 20:44
@cftarnas It is, Python UDFs have relatively high serde overhead. – zero323 Sep 23 '16 at 21:24
10

@zero323 you don't actually need the `otherwise()` part - `when()` will return `null` by default. – pault Jul 11 '18 at 14:36
Could you please show how to do that with nested fields? – Ébe Isaac Sep 25 '18 at 09:00
@zero323 in otherwise, I tried returning None instead of lit(None) and the result is same. Can you please explain why you used lit? – ravi malhotra Oct 05 '19 at 09:20
@zero323 when can directly mention .otherwise(lit(null )) right ? – BdEngineer Sep 15 '20 at 09:42

score 9 · Accepted Answer · answered Apr 27 '16 at 20:10

9

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

answered Apr 27 '16 at 20:10

Daniel Zolnai

16,487
7
59
71

I didn't think of trying UDFs, that seems to be the way to go – talloaktrees Apr 27 '16 at 21:02
2

Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – RobinL Jan 15 '20 at 11:26

score 8 · Answer 3 · answered Jul 08 '20 at 14:46

8

You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.

df = df.replace({'empty-value':None}, subset=['NAME'])

Note that your 'empty-value' needs to be hashable.

answered Jul 08 '20 at 14:46

Willem

976
9
24

score 5 · Answer 4 · answered Nov 22 '18 at 19:58

The best alternative is the use of a when combined with a NULL. Example:

from pyspark.sql.functions import when, lit, col

df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))

If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.

Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Pradyut Joshi · Answer 5 · 2023-07-05T11:58:45.043

0

This will also work:

df5 = df.replace(str('empty-value'), None)

edited Jul 05 '23 at 11:58

answered Jul 05 '23 at 11:57

Pradyut Joshi

1
1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 09 '23 at 13:20

How do I replace a string value with a NULL in PySpark?

5 Answers5

Linked