On my Ubuntu server I have a directory that contains these two files:
testDir# ls -als
insgesamt 12
4 drwxr-xr-x 2 root root 4096 Mai 29 15:12 .
4 drwxr-xr-x 6 root root 4096 Mai 28 18:38 ..
0 -rw-r--r-- 1 root root 0 Mai 28 19:17 Ö.txt
4 -rw-r--r-- 1 root root 9 Mai 28 19:16 Ö.txt
The file names look the same, but they are not. The file with size 0 has 1 character before the dot (Unicode code point 214 = Ö), the other file (size = 9) has two characters (code point 79 = O followed by 776 = ¨ which is a combining character and modifies the character before it). To display the unicode code points, I wrote this little script:
#!/usr/bin/env python3
import os
def printFileList(fileList):
for file in fileList:
string = ""
for char in file:
string += str(ord(char)) + " "
string += "<br>"
print(string)
print("Content-Type: text/html\n")
printFileList(os.listdir("testDir"))
printFileList(["Ö.txt", "Ö.txt"])
As you can see, I first read the filenames form the operation system and display the code points of the characters of the file names. Then I do the same, but with strings that are written hard coded in program code.
When I run this program from the shell, I get this result:
testDir# ./test.py
Content-Type: text/html
79 776 46 116 120 116 <br>
214 46 116 120 116 <br>
79 776 46 116 120 116 <br>
214 46 116 120 116 <br>
But this script (to be more precise: a more advanced version of this script) is meant to be run as a CGI script from a webserver. My webserver is Apache 2, and when I call this script from a browser, I get this result:
79 56524 56456 46 116 120 116
56515 56470 46 116 120 116
79 776 46 116 120 116
214 46 116 120 116
The String Content-Type: text/html is part of the http protocol and will not be displayed, and <br> appears as line breaks, so these parts aren't visible in a browser for good reasons. But look at the numbers!
What should be 776 is 56524 56456 in the first line, and in the second line 214 became 56515 56470. But this happened only for the filenames read form the operating system. The hard coded strings are correct.
My questions:
1) What causes this strange behavior?
2) What has to be changed, so that the correct code points (776 and 214) are shown?
addendum
I added these lines to my program:
import sys
print(sys.getfilesystemencoding())
The output of this line is:
when run from the shell:
utf-8which is correct.
when run from apache as CGI-script:
asciiwhich is wrong.
So, my new question is:
How can I tell my script, that it always should use utf-8 as file system encoding?