3

I have a very large text-file (> 50 GB), but most lines are duplicate, so I want to remove them. Is there any way to remove duplicates lines from a file, and handle files > 2GB? Because every method I found until now can only work on small files.

Oliver Salzburg
  • 89,072
  • 65
  • 269
  • 311
Maestro
  • 603
  • 1
  • 6
  • 16

2 Answers2

4

Assuming all lines are shorter than 7kB, and that you have bash, dd, tail, head, sed and sort installed from cygwin/unix:

{
  i=0
  while LANG= dd 2>/dev/null bs=1024 skip=${i}000 if=large_text_file count=1021 \
  | LANG= sed -e '1d' -e '$d'  | LANG= sort -u ;
  do
    i=$((1+$i))
  done
  LANG= dd 2>/dev/null bs=1024 skip=${i}000 if=large_text_file count=1021 \
  | LANG= tail -n 1
  LANG= head -n 1 large_text_file
} | LANG= sort -u > your_result

This divides the file in chunks of 1024000 bytes, and adds also 3*7*1024 bytes ("21" in 1021) from next chunk. As the divisions may cut a line, first (1d) and last ($d) lines of each chunks are destroyed (sed).

So to compensate, something containing last chunk is extracted again and only its last line is kept (tail -n 1), and the first line is also extracted again (head -n 1).

When the loop fails, the last chunk has been extracted.

sort -u may be viewed as a compressor, but it only sorts its input then skip duplicates. The first "sort" compresses all chunks. The second sort compresses again the concatenations of all these chunks (and that second sort has been missing from above code since third edit, sorry).

You said text file, but I assume binary anyway, hence the LANG= (gets all faster also).

0

Fire up a linux instance in AWS/GCE , and use 'uniq'. OSX has it as well...

Docs here: http://www.thegeekstuff.com/2013/05/uniq-command-examples/