remove lines with duplicate words

Question

I have a sorted file with lines like this

word1  abca
word1  abcb
word1  abcc
word2  abca
word2  abcb
word3  abbb
...........

and i want to have something like this

word1  abca
word2  abca
word3  abbb
...........

score 4 · Accepted Answer · answered Jun 07 '14 at 11:56

4

This magic incantation is a famous awk idiom:

awk '!seen[$1]++' file

The first time a line with that $1 is seen, the line is printed.

answered Jun 07 '14 at 11:56

glenn jackman

27,524

suspectus · Answer 2 · 2014-06-07T09:44:27.720

1

An awk solution - using a variable to detect a new word. If a new word is found, print the line and assign the variable to the current word.

As the data file is sorted, only the 1st occurence of each word will print the record.

   awk 'BEGIN{w=""} w!=$1 {print;w=$1}' your-file

edited Jun 07 '14 at 09:44

answered Jun 07 '14 at 09:27

suspectus

5,008

score 0 · Answer 3 · answered Jun 07 '14 at 13:21

You could also use the -w flag of uniq which tells it to only compare the first N characters. The details of whether this will work for you depend on your actual data but if the word lengths are set or limited, it should work:

$ sort file.txt | uniq -w 5
word1  abca
word2  abca
word3  abbb

Alternatively, reverse the order of the fields and use uniq -f 1 to skip comparing the 1st field:

$ awk '{print $2,$1}' file.txt | uniq -f 1 | awk '{print $2,$1}'
word1 abca
word2 abca
word3 abbb

Or get the 1st fields and then grep for them, limiting the search to the first match:

$ for i in $(awk '{print $1}' file.txt | sort -u); do grep -m 1 $i file.txt; done
word1  abca
word2  abca
word3  abbb

And, for completion's sake, a Perl one:

$ perl -ane 'print if $k{$F[0]}++<1' file.txt 
word1  abca
word2  abca
word3  abbb

remove lines with duplicate words

3 Answers3