Pyspark substring of one column based on the length of another column

Question

Using Pyspark 2.2

I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column

Input:

+-----+------+
|col_A| col_B|
+-----+------+
|  abc|abcdef|
|  abc|     a|
+-----+------+

Both col_A and col_B are StringType()

Desired output:

+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
|  abc|abcdef|    abc|
|  abc|     a|      a|
+-----+------+-------+

I want new_col to be a substring of col_A with the length of col_B.

I tried

udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()

But it gives the TypeError: Column is not iterable.

Any idea how to do such manipulation?

score 2 · Answer 1 · answered Feb 25 '19 at 03:49

There are two major things wrong here.

First, you defined your udf to take in one input parameter when it should take 2.
Secondly, you can't use the API functions within the udf. (Calling the udf serializes to python so you need to use python syntax and functions.)

Here's a proper udf implementation for this problem:

import pyspark.sql.functions as F

def my_substring(a, b):
    # You should add in your own error checking
    return a[:len(b)]

udf_substring = F.udf(lambda x, y: my_substring(a, b), StringType())

And then call it by passing in the two columns as arguments:

df.withColumn('new_col', udf_substring(F.col('col_A'),F.col('col_B')))

However, in this case you can do this without a udf using the method described in this post.

df.withColumn(
    'new_col', 
    F.expr("substring(col_A,0,length(col_B))")
)

thanks for this! regarding the second method do we need to add an if else condition for cases when length(col_B) is less than length(col_A) or is it implicitly handled? — Wynn, Feb 25 '19 at 06:00
@Wynn the second method will return the full string in `col_A` if the length of `col_B` is less than the length of `col_A`. No `if/else` required. — pault, Feb 25 '19 at 14:56

Pyspark substring of one column based on the length of another column

1 Answers1