I want to do something like this:
df.replace('empty-value', None, 'NAME')
Basically, I want to replace some value with NULL, but it does not accept None as an argument. How can I do this?
I want to do something like this:
df.replace('empty-value', None, 'NAME')
Basically, I want to replace some value with NULL, but it does not accept None as an argument. How can I do this?
You can combine when clause with NULL literal and types casting as follows:
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## | x| y|
## +---+----+
## | 1| foo|
## | 2|null|
## +---+----+
It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.
This will replace empty-value with None in your name column:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()
Output:
[Row(key=1, name=None), Row(key=2, name=u'something else')]
By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.
You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.
df = df.replace({'empty-value':None}, subset=['NAME'])
Note that your 'empty-value' needs to be hashable.
The best alternative is the use of a when combined with a NULL. Example:
from pyspark.sql.functions import when, lit, col
df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))
If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.
Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.
This will also work:
df5 = df.replace(str('empty-value'), None)