I have a set of TSV files on HDFS structured like this:
g1 a
g1 b
g1 c
g2 a
g2 x
g2 y
g3 b
g3 d
...
I'd like to convert these files into files called hdfs:///tmp/g1.tsv, hdfs:///tmp/g2.tsv, and hdfs:///tmp/g3.tsv such that...
g1.tsv looks like:
a
b
c
g2.tsv looks like:
a
x
g3.tsv looks like:
b
d
etc.
These files are large, and I'd like to do the renaming as parallel as possible. Is there a simple MapReduce job, Spark Job, or an HDFS file operation for doing this?