I know that the dataframe in pyspark has their partition and when I apply a function (udf) on one column, different partition will apply the same function in parallel.
df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)
data=df.rdd.map(lambda x:x['label']).collect()
def ad(x):
    return data.pop(0).lower()
AD=F.udf(ad,StringType())
df.withColumn('station',AD('label')).select('station').rdd.flatMap(lambda x:x).collect()
here is the output:
['a', 'a', 'a', 'a']
which should be:
['a', 'b', 'a', 'b']
And the most strange thing is that
data
didn't even change after we call the functio
data.pop(0)
