Apply wordninja.split() using pandas_udf

Question

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:

E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']

Using pandas_udf, I have:

@pandas_udf(ArrayType(StringType()))
def split_word(x):
   splitted = wordninja.split(x)
   return splitted

However, it throws an error when I apply it on the column sld:

df1=df.withColumn('test', split_word(col('sld')))

typeerror: expected string or bytes-like object

What I tried:

I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.

Any work around this issue?

Edit: I think in a nutshell the issue is: the pandas_udf input is pd.series while wordninja.split expects string.

My df looks like this:

+-------------+
|sld          |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this"       |
|"that"       |
+-------------+

I want something like this:

+-------------+---------------------+
|    sld      |         test        |
+-------------+---------------------+
|"hellofriend"|["hello","friend"]   |
|"restinpeace"|["rest","in","peace"]|
|"this"       |["this"]             |
|"that"       |["that"]             |
+-------------+---------------------+

score 1 · Accepted Answer · answered Aug 05 '22 at 14:20

1

Just use .apply to perform computation on each element of the Pandas series, something like this:

@pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
   splitted = x.apply(lambda s: wordninja.split(s))
   return splitted

answered Aug 05 '22 at 14:20

Alex Ott

80,552
8
87
132

ZygD · Answer 2 · 2022-09-05T12:51:17.337

0

One way is using udf.

import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])

@F.udf
def split_word(x):
   return wordninja.split(x)

df.withColumn('col2', split_word('sld')).show()
# +-----------+-----------------+
# |        sld|             col2|
# +-----------+-----------------+
# |hellofriend|  [hello, friend]|
# |restinpeace|[rest, in, peace]|
# |       this|           [this]|
# |       that|           [that]|
# +-----------+-----------------+

edited Sep 05 '22 at 12:51

answered Aug 05 '22 at 13:38

ZygD

22,092
39
79
102

Thanks for your comment, I have tried udf but it is really slow! it goes like row by row :( – Elm662 Aug 05 '22 at 13:44

Apply wordninja.split() using pandas_udf

2 Answers2