3

I want to print the first 1000 characters in an UTF-8 encoded file. I know that the head tool can print the first n bytes of a file, but it may cut a character in the middle so that I get garbled output at the end.

I can write a awk program to do this, but may I know if there is any simpler way?

PS. I found it unreasonable that head and tail do not support character encoding (the LANG environment variable), while other tools such as cut, wc, sed and awk all support character encoding.

2 Answers2

0

Just to provide the awk way that OP mentioned to other Googlers, let's test with 5 acute accented vowels:

printf 'áéíóú' | LC_CTYPE=en_US.UTF-8 awk '{print substr($0,1,3);exit}'

picks the first three characters and outputs the desired:

áéí

Each of those UTF-8 characters is 2 bytes long, we can check that with:

printf 'áéíóú' | hd

which gives:

00000000  c3 a1 c3 a9 c3 ad c3 b3  c3 ba                    |..........|
0000000a

so we could equivalently test it as:

printf '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba' | LC_CTYPE=en_US.UTF-8 awk '{print substr($0,1,3);exit}'

If we use the wrong locale, e.g. C which treats each byte separately:

printf 'áéíóú' | LC_CTYPE=C awk '{print substr($0,1,3);exit}' | hd

gives the first three bytes:

c3 a1 c3

which shows on the terminal as just:

á

since the c3 is trash by itself.

Not sure how it compares to iconv in terms of performance for huge inputs. But for small stuff this is good enough and simple.

Tested on Ubuntu 21.04.

0

Not sure it is simpler, but this my way:

cat file | iconv -t UTF-32 | head -c $[1000 *4+4] | iconv -f UTF-32

This converts to a fixed-width form of Unicode so that the 1000 will always represent whole characters.