I am trying to load a table in Hive which has accented letters. I was initially using openCSVSerde to parse the CSV file and load it to the table. However, when I come to the accented letters, it prints a � in place of the accented letters. I have tried several methods but it is not working.
- Using openCSVserde, I tried declaring
serialization.encoding = 'windows-1252'inside theTBLPROPERTIESsection, it did not resolve the issue. I did check again by addingserialization.encoding = 'windows-1252'insideWITH SERDEPROPERTIES. That also did not work. - Using openCSVserde, I tried explicitly declaring
serialization.encoding = 'utf-8'inside theWITH SERDEPROPERTIES, it did not work. - Using openCSVserde, I tried declaring
serialization.encoding = 'ISO-8859-1'inside theWITH SERDEPROPERTIES, it did not work.
I, then, switched over to LazySimpleSerde as I read somewhere, that it is compatible with accented characters. I tried setting serialization.encoding = 'windows-1252' inside WITH SERDEPROPERTIES which worked, but it brought a new error. Some of the text columns had quotes which split the data and loaded it incorrectly into the table.
- So now, I tried using
'quote.delim'='"'insideWITH SERDEPROPERTIESwhich did not fix the incorrect data-split. - I tried using
'quoteChar'='"'insideWITH SERDEPROPERTIESwhich ,also, did not fix the incorrect data-split.
I reverted to openCSVSerde and tried using serialization.encoding = 'ISO-8859-1' inside WITH SERDEPROPERTIES as well as
store.charset = 'ISO-8859-1',retrieve.charset = 'ISO-8859-1'
inside TBLPROPERTIES, which solved the incorrect data-split but brought me back to not being able to print the accented characters. I also tried serialization.encoding = 'utf-16' inside WITH SERDEPROPERTIES, which, unsurprisingly, did not resolve the issue.
Can anyone tell how I can use openCSVSerde to print the accented letters?