I tried to look this up in the man pages of the sort command, but could not find anything.
So consider the following text file t.txt:
11
1 0
(Binary representation of t.txt
$ xxd -p t.txt
2031310a3120300a
)
using LC_COLLATE="en_US.UTF-8" with sort on this file gives:
$ LC_COLLATE="en_US.UTF-8" sort t.txt
1 0
11
If we examine the second character position (or column) in the file, we observe that the first
row has a space, and the second row has a 1.
Since space has hexadecimal value of 0x20 which is less than the hexadecimal value of 1 (which is 0x31)
I would assume that sort would give:
11
1 0
It turns out that the expected sorting order can be obtained using LC_COLLATE=c
$ LC_COLLATE=c sort t.txt
11
1 0
What is the reason for the difference between LC_COLLATE="en_US.UTF-8" and LC_COLLATE=c for this case?
See also:
- What does “LC_ALL=C” do?
- Why does ls sorting ignore non-alphanumeric characters?
- How do locales work in Linux / POSIX and what transformations are applied?
- Internationalization: Collate (Sort) Order, Character Set, Accents, GLOB patterns
Edit:
Some more information about this issue was found here: