So I created two files, each with 127,955 lines and 24 random characters per line. As far as text the two files were completely identical. However one file had Unix line breaks and the other had Windows line breaks. The file with Unix line breaks was 3,124 KB, while the file with Windows line breaks was 3,249 KB. There were no other differences between the files so I have to assume that for some reason Windows line breaks take up more space. Any idea why?
3 Answers
If you opened the text files in hex editor, the difference you would see at the end of a line would be the following:
Windows Line Endings: 0x0D 0x0A
Unix Line Endings: 0x0A
The 0x0D is the hex value for the carriage return (represented textually simply as \r).
The 0x0A is the hex value for the new line character (represented textually simply as \n).
When line endings are in the Windows EOL format, the lines will end with 2 characters: \r\n; while the Unix EOL format ends with 1 character: \n.
So, 127,955 * (24 + 1) == 3,198,875 bytes (3,123.9 KB) for Unix EOL and 127,955 * (24 + 2) == 3,326,830 bytes (3,248.86 KB) for Windows EOL.
Hope that helps.
- 3,871
As for the actual "why" bit — Historically, a teletypewriter used Carriage Return (hex 0D) to move the print head to the left margin, followed by a Line Feed (hex 0A) to advance the paper.
Commodore, Atari, and (pre-Unix) Apple kept the Carriage Return as their line-ending symbol; Unix kept the Line Feed; and CP/M / DOS kept both.
Many Internet protocols (eg, HTTP) still are defined in terms of both (aka "CRLF"), but in actual text files, the only program on Windows that I've encountered that doesn't deal correctly with "just" a Line Feed is Notepad.
Technically, the term "Newline" exists just to mask this historical difference. Eg, in C a "\n" or in Lisp a #\Newline maps to whichever notation the local system happens to prefer, compared to "\r" or #\Return when one particular byte-character is wanted specifically.
- 241
Windows uses a carriage return followed by a newline. Unix just uses a newline. So that's one extra byte per line break.
- 62,365