How to assign unique ids to entries in a column using PySpark?

Question

For example, if I have a dataframe with a name column, where each name can occur multiple times:

+------+
| name |
+------+
|Alice |
|Bob   |
|Alice |
|Chloe |
|Chloe |
+------+

I want to have a column where each name gets a unique id starting from 0:

+------+----+
| name | id |
+------+----+
|Alice | 0  |
|Bob   | 1  |
|Alice | 0  |
|Chloe | 2  |
|Chloe | 2  |
+------+----+

How do I achieve this using PySpark? One possible way is to create a data frame with a column of distinct names and assign index such as using row_number. But that involves joining back to the original table. So I wonder if there's a direct way to achieve this.

This should help! https://stackoverflow.com/questions/52318016/pyspark-add-sequential-and-deterministic-index-to-dataframe — greenie, Sep 13 '22 at 21:22
@greenie that post is about adding id for each row, what I want is unique id for each name — Algorithman, Sep 13 '22 at 21:33
ok :) @algorithman. Check out StringIndexer. I have added a reply. — greenie, Sep 13 '22 at 21:41

score 1 · Answer 1 · answered Sep 13 '22 at 21:40

One easy method is to use StringIndexer - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html

from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import col
df = spark.createDataFrame([("Alice", 1), ("Bravo", 2), ("Charlie", 3), ("Alice", 4), ("Bravo", 5)], ["name", "id"])

string_indexer = StringIndexer(inputCol='name', outputCol='index').setHandleInvalid("keep")
model = string_indexer.fit(df)
df_index = model.transform(df).select('id', 'name', col('index').cast('int'))
df_index.show()   
+---+-------+-----+
| id|   name|index|
+---+-------+-----+
|  1|  Alice|    0|
|  2|  Bravo|    1|
|  3|Charlie|    2|
|  4|  Alice|    0|
|  5|  Bravo|    1|

+---+-------+-----+

score 1 · Accepted Answer · answered Sep 13 '22 at 22:01

1

Use window functions. Logic and code below

   new =(df.withColumn('order', row_number().over(Window.partitionBy(lit('1')).orderBy(lit('1'))))#Create increasing id
          .withColumn('id', dense_rank().over(Window.partitionBy().orderBy('name'))-1).orderBy('order')#Use window function; dense_ranl to generate new id
          .drop('order')
         ).show()

+-------+---+
|   name| id|
+-------+---+
|  Alice|  0|
|  Bravo|  1|
|Charlie|  2|
|  Alice|  0|
|  Bravo|  1|
+-------+---+

answered Sep 13 '22 at 22:01

wwnde

26,119
6
18
32

thanks this works. Just curious, will it be possible to assign id without ordering? based on the occurrence of name, e.g. if Bravo shows before Alice then bravo will get id 0 and Alice 1 – Algorithman Sep 14 '22 at 01:56

How to assign unique ids to entries in a column using PySpark?

2 Answers2