I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    f = open("text.txt", "r+")
    text = f.read()
    sentences = re.split("[.!?]\s", text)
    words = re.split("\s", sentences[0])
    print sentences[0]
    print words
and result is:
Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny
['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']
When I use:
f = codecs.open("text.txt", "r+", encoding="utf-8")
result is:
Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny
[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']
and I need output like:
['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]
 
     
    