I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?
I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?
With row_number window function:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
 
    
    You can achieve this on rdd level by doing:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result:
+---+---+
| _1| _2|
+---+---+
|  a|  0|
|  b|  1|
|  c|  2|
+---+---+
If you only need unique ID, not real continuous indexing, you may also use
zipWithUniqueId() which is more efficient, since done locally on each partition. 
