Trying to create a new column in a PySpark UDF but the values are null!
Create the DF
data_list = [['a', [1, 2, 3]], ['b', [4, 5, 6]],['c', [2, 4, 6, 8]],['d', [4, 1]],['e', [1,2]]]
all_cols = ['COL1','COL2']
df = sqlContext.createDataFrame(data_list, all_cols)
df.show()
+----+------------+
|COL1|        COL2|
+----+------------+
|   a|   [1, 2, 3]|
|   b|   [4, 5, 6]|
|   c|[2, 4, 6, 8]|
|   d|      [4, 1]|
|   e|      [1, 2]|
+----+------------+
df.printSchema()
root
 |-- COL1: string (nullable = true)
 |-- COL2: array (nullable = true)
 |    |-- element: long (containsNull = true)
Create a function
def cr_pair(idx_src, idx_dest):
    idx_dest.append(idx_dest.pop(0))
    return idx_src, idx_dest
lst1 = [1,2,3]
lst2 = [1,2,3]
cr_pair(lst1, lst2)
([1, 2, 3], [2, 3, 1])
Create and register a UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType
get_idx_pairs = udf(lambda x: cr_pair(x, x), ArrayType(IntegerType()))
Add a new column to the DF
df = df.select('COL1', 'COL2',  get_idx_pairs('COL2').alias('COL3'))
df.printSchema()
root
 |-- COL1: string (nullable = true)
 |-- COL2: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- COL3: array (nullable = true)
 |    |-- element: integer (containsNull = true)
df.show()
+----+------------+------------+
|COL1|        COL2|        COL3|
+----+------------+------------+
|   a|   [1, 2, 3]|[null, null]|
|   b|   [4, 5, 6]|[null, null]|
|   c|[2, 4, 6, 8]|[null, null]|
|   d|      [4, 1]|[null, null]|
|   e|      [1, 2]|[null, null]|
+----+------------+------------+
Here where the problem is. I am getting all values 'null' in the COL3 column. The intended outcome should be:
+----+------------+----------------------------+
|COL1|        COL2|                        COL3|
+----+------------+----------------------------+
|   a|   [1, 2, 3]|[[1 ,2, 3], [2, 3, 1]]      |
|   b|   [4, 5, 6]|[[4, 5, 6], [5, 6, 4]]      |
|   c|[2, 4, 6, 8]|[[2, 4, 6, 8], [4, 6, 8, 2]]|
|   d|      [4, 1]|[[4, 1], [1, 4]]            |
|   e|      [1, 2]|[[1, 2], [2, 1]]            |
+----+------------+----------------------------+
 
     
    