don't sort for no reason :
nawk '_[$-__]--'
gawk '__[$_]++'
mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields
mawk2 '__[$_]++' FS='\n'
for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk, piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end
(e.g. instance 4 handle lines beginning with F-Q, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx") of each unique line ("Lx") has been recorded.
From there one could sort a much smaller file along the column holding the Lx's, THEN pipe it to one more awk that would print out Nx# copies of each line Lx.
probably a lot faster than trying to sort 100 GB
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB test file :
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk, it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin