Add a new column to a Dataframe. New column i want it to be a UUID generator

Question

I want to add a new column to a Dataframe, a UUID generator.

UUID value will look something like 21534cf7-cff9-482a-a3a8-9e7244240da7

My Research:

I've tried with withColumn method in spark.

val DF2 = DF1.withColumn("newcolname", DF1("existingcolname" + 1)

So DF2 will have additional column with newcolname with 1 added to it in all rows.

By my requirement is that I want to have a new column which can generate the UUID.

MaxNevermind · Answer 1 · 2020-04-21T14:22:24.213

22

You can utilize built-in Spark SQL uuid function:

.withColumn("uuid", expr("uuid()"))

A full example in Scala:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

object CreateDf extends App {

  val spark = SparkSession.builder
    .master("local[*]")
    .appName("spark_local")
    .getOrCreate()
  import spark.implicits._

  Seq(1, 2, 3).toDF("col1")
    .withColumn("uuid", expr("uuid()"))
    .show(false)

}

Output:

+----+------------------------------------+
|col1|uuid                                |
+----+------------------------------------+
|1   |24181c68-51b7-42ea-a9fd-f88dcfa10062|
|2   |7cd21b25-017e-4567-bdd3-f33b001ee497|
|3   |1df7cfa8-af8a-4421-834f-5359dc3ae417|
+----+------------------------------------+

edited Apr 21 '20 at 14:22

answered Apr 21 '20 at 14:15

MaxNevermind

2,784
2
25
31

1

How do you do this in python? I get ('Column is not iterable') when i do expr(uuid()) – JS noob May 19 '21 at 01:13
1

`from pyspark.sql.functions import expr` `df.withColumn("uuid", expr("uuid()"))` – Yukeshkumar Jun 04 '21 at 15:58

Paweł Jurczenko · Accepted Answer · 2016-05-14T21:49:03.863

17

You should try something like this:

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val generateUUID = udf(() => UUID.randomUUID().toString)
val df1 = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val df2 = df1.withColumn("UUID", generateUUID())

df1.show()
df2.show()

Output will be:

+---+-----+
| id|value|
+---+-----+
|id1|    1|
|id2|    4|
|id3|    5|
+---+-----+

+---+-----+--------------------+
| id|value|                UUID|
+---+-----+--------------------+
|id1|    1|f0cfd0e2-fbbe-40f...|
|id2|    4|ec8db8b9-70db-46f...|
|id3|    5|e0e91292-1d90-45a...|
+---+-----+--------------------+

edited May 14 '16 at 21:49

answered May 14 '16 at 21:31

Paweł Jurczenko

4,431
2
20
24

7

I see two problems with this solution. The first is that UUID.randomUUID isn't guaranteed to be unique across a cluster. It uses pseudo-random numbers, which is fine on a single machine, but in a cluster, you could get collisions. Second, the Docs for User Defined Functions say that UDFs must be deterministic. By definitions, UUIDs are not. https://github.com/apache/spark/blob/51b1c1551d3a7147403b9e821fcc7c8f57b4824c/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala#L38-L40 – irbull Jun 13 '18 at 23:40
1

For more info about how non-determinism can cause problems, see https://stackoverflow.com/questions/42960920/spark-dataframe-random-uuid-changes-after-every-transformation-action – Mark Rajcok Jul 28 '18 at 20:51
1

UDFs are bad for your spark code on various levels: for one, Catalyst can't do anything with them because they're not transparent to the optimization engine. – Marco Jul 07 '21 at 09:18

Sachin Thapa · Answer 3 · 2017-05-08T17:10:46.820

-3

This is how we did in Java, we had a column date and wanted to add another column with month.

Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));

You can use similar technique to add any column.

Dataset<Row> newData1 = newData.withColumn("uuid", lit(UUID.randomUUID().toString()));

Cheers !

edited May 08 '17 at 17:10

answered May 08 '17 at 16:59

Sachin Thapa

3,559
4
24
42

4

With the second approach, it add the same ID in all the rows – Adiant Apr 09 '18 at 14:36

Add a new column to a Dataframe. New column i want it to be a UUID generator

3 Answers3

Linked

Related