6

how can I remove non-duplicate lines from text file using any linux program linke sed, awk or any other?

Example:

abc
bbc
abc
bbc
ccc
bbc

Result:

abc
bbc
abc
bbc
bbc

Second list have removed ccc because it didn't have duplicate lines.

Is it also possible to remove lines, that are non-duplicate AND lines that have only 2 duplicates, and leave those who have more then 2 duplicates lines?

karel
  • 13,706
qlwik
  • 63

3 Answers3

11

The solutions posted by others do not work on my Debian Jessie: they keep a single copy of any duplicate line, while it is my understanding of the OP that all copies of the duplicate lines are to be kept. If I have understood the OP right, then ...

  1. The following command

    awk '!seen[$0]++' file
    

    removes all duplicate lines.

  2. The following command

    awk 'seen[$0]++' file 
    

    outputs all the duplicates, but not the original copy: i.e., if a line appears n times, it outputs the line n-1 times.

  3. Then the command

    awk 'seen[$0]++' file > temp && awk '!seen[$0]++' file >> temp
    

    solves your problem. The lines are not in the original order.

  4. If you want lines which have two or more duplicates, you can now iterate the above:

    awk 'seen[$0]++' file | awk 'seen[$0]++' > temp
    

    keeps n-2 copies of the lines which have n>1 duplicates. Now

    awk '!seen[$0]++' temp > temp1 
    

    removes all duplicate lines from the temp file, and you can now obtain what you wish (i.e. only the lines with n>1 duplicates) as follows:

    cat temp1 >> temp; cat temp1 >> temp
    
  5. If you need to do this for lines which appear N or more times, the following command

      awk 'seen[$0]++ && seen[$0] > N' file 
    

    is simpler than chaining N times the command awk 'seen[$0]++' file.

MariusMatutiae
  • 48,517
  • 12
  • 86
  • 136
7

You can use sort & uniq commands for this.

If your data in abc.txt file, then;

cat abc.txt |sort|uniq -d

Out put will be;

abc 
bbc
UUU
  • 108
0

The answer by @UUU doesn't keep sort order. To keep sort order, use the following instead:

 printf '%s\n' abc bbc abc bbc ccc bbc | \
     nl -nrz     | \
     sort -k2    | \
     uniq -f1 -D | \
     sort        | \
     cut -f2
  1. The printf command simply reproduces the input.
  2. nl command appends line numbers with leading zeros to allow for sort without the -V flag.
  3. sort command sorts by the 2nd field. By default fields are separated by blanks.
  4. uniq command identifies unique adjacent lines (which is why you have to sort first) and -f1 skips the first field which is the line number.
  5. sort again to restore the original order.
  6. cut to remove the leading line numbers.
RJ-
  • 981