I am trying to process a large csv file that contains Greek words. A sample of the data:
Three column table, Greek (unicode) words in first and third column
Or:
FileID  Word Num    Normalized  Normalized  POS Lemma   MorphoFeats w/n/naw Element
susi0011    2   Θεόδορος    Θεόδορος    PROPN   Θεύδωρος    Case=Nom|Gender=Masc|Number=Sing    naw <orig xml:id="susi0011-2" xml:lang="grc">Θεόδορος</orig>
susi0012    2   Σιμονίου    Σιμονίου    PROPN   Σιμονύος    Case=Gen|Gender=Masc|Number=Sing    naw <orig xml:id="susi0012-2" xml:lang="grc">Σιμονίου</orig>
susi0012    3   πρεσβίτερος πρεσβίτερος ADJ πρέσβυς Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">πρεσβίτερος</orig>
I read it in with the simple:
df=pd.read_csv('myfile.csv',encoding='utf-8')
I also tried:
with open ('myfile.csv',encoding='utf-8') as f:
  df=pd.read_csv(f)
Yet, when I use df.head() (in both cases) my Greek words come out as ???.  Thinking that this might be more of a display issue, I also tried writing the dataframe back out as a csv (both with and without an encoding parameter) but the Greek was also lost.  It looks something like this in the output:
FileID  Word Num    Normalized  Normalized.1    POS Lemma   MorphoFeats w/n/naw Element
0   susi0011    2   ????????    ????????    PROPN   ????????    Case=Nom|Gender=Masc|Number=Sing    naw <orig xml:id="susi0011-2" xml:lang="grc">?????...
1   susi0012    2   ????????    ????????    PROPN   ????????    Case=Gen|Gender=Masc|Number=Sing    naw <orig xml:id="susi0012-2" xml:lang="grc">?????...
2   susi0012    3   ??????????? ??????????? ADJ ??????? Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">?????...
Any suggestions?
 
    