I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']
Using pandas_udf, I have:
@pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf input is pd.series while wordninja.split expects string.
My df looks like this:
+-------------+
|sld |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
+-------------+
I want something like this:
+-------------+---------------------+
| sld | test |
+-------------+---------------------+
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
+-------------+---------------------+