3

I want to write a MapReduce java Program where I need to create UUID for a set of data in csv/txt file. The data will be a customer data with set of rows and column. The input csv is located in the HDFS directory.

Just need to generate UUID using Mapreduce. I have an input file which has colums a, b and c and has 5 rows. I need a column d with UUID with 5 rows i.e. 5 different UUIDs

How can i go about it?

Here is the code for Mapper class:

public class MapRed_Mapper extends Mapper{

public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
     Text uuid = new Text(UUID.randomUUID().toString());
    context.write(key, uuid);
}

}

Community
  • 1
  • 1
  • Do you need UUID or a Key for the data? You can always use Java UUID class. But if you need a key that uniquely identifies the data across MapReduce nodes you need to do it differently because UUID between MapReduce nodes will be different for the same data. – tsolakp Jul 07 '17 at 19:54
  • I just need to generate UUID using Mapreduce. Suppose i have an input file which has colums a, b and c and has 5 rows. i need a column d with UUID with 5 rows I.E 5 different UUIDs. – Rishab Oberoi Jul 07 '17 at 20:18

2 Answers2

0
  • Mapreduce java approach

1) Read your rows in mapper class map method from text file

2) add UUID as shown below in reduce method as extra column (use single reducer to reduce your csv as extracolumn)

3) emit it through context.write

java.util.UUID, available since JDK 5.

Creating a random UUID (Universally unique identifier).

To obtain the value of the random string generated we need to call the UUID.toString() method.

    UUID uuid = UUID.randomUUID();
    String randomUUIDString = uuid.toString();

    System.out.println("Random UUID String = " + randomUUIDString);
   // System.out.println("UUID version       = " + uuid.version());
   // System.out.println("UUID variant       = " + uuid.variant());

For CSV genearation :
Use TextOutputFormat. The default key/ value separator is a tab character.Change the separator by setting the property mapred.textoutputformat.separatorText in your driver.

conf.set("mapred.textoutputformat.separatorText", ",");
  • Spark Approach (since you added spark tag I thought of giving below pointer) :

There is already existing answer in SO, pls see.

add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator

Then you can do below to convert to csv format.

df.write.format("com.databricks.spark.csv").save(filepath)
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • I have added the code in the question section. The part file generated by this code gave blank output. Input was a csv with 3 columns and 5 rows. Can you see where i am going wrong? – Rishab Oberoi Jul 08 '17 at 19:07
  • where are you reading file input ? I cant see any of the fields you are reading . the program is not correct. append uuid column in reducer and use single reducer(set number of reducers 1) to reduce as csv file. other wise uuid between different mappers may have a collision – Ram Ghadiyaram Jul 09 '17 at 13:43
  • How do i read it? i am passing the input file using the args. Do i have to read it in the mapper stage? then collect it and add uuid in reducer? I have not worked with MapReduce program prior to this. – Rishab Oberoi Jul 09 '17 at 13:48
  • yes inside the mapper you can use like `String[] tokens = value.toString().split(",");` then you will get all the csv column values then pass them as it is to reducer and then in reducer add new column and write in to context – Ram Ghadiyaram Jul 09 '17 at 14:06
  • https://dzone.com/articles/writing-hadoop-mapreduce-task follow this as sample program if you dont know how to write mapreduce code. but approach must be as discussed above for syntax of how to write Im giving this as an example – Ram Ghadiyaram Jul 09 '17 at 14:37
  • Thanks, I will try this and get back. – Rishab Oberoi Jul 09 '17 at 14:41
  • I have updated the Mapper class as i am trying to use only mapper to append the UUID, i am getting the outputs with UUID but there is same UUID for all the records. – Rishab Oberoi Jul 10 '17 at 09:18
0

Maybe I'm not getting the question, but you can just generate a UUID for every call to map by doing:

@Override
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
    context.write(key, new Text(UUID.randomUUID().toString());
}
Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91