Back Story
I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.
The Issue
These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.
"Je?ro^me" //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
"Jéróme" //code-points 6. Code-points: 74, 233, 114, 243, 109, 101
The question
How do I get "Je?ro^me" back to "Jéróme"?
Things that I have tried
- Used Notepad++ to convert the encoding to or from
UTF8,ANSI, andWINDOWS-1252. - Created a Map that looks for things like
e?and convert them toé.
Issues with the two attempts to solve the problem
a. The issue still existed after trying different conversions.
b. Two issues here:
- I don't know all of the potential
e?,o^, etc to look for. There are over 20,000 files that may cover many languages. - What if I have a sentence that ends in
e?
Things I researched to gain a better understanding of the issue
- What is a "surrogate pair" in Java?
- https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
- https://www.w3.org/International/questions/qa-what-is-encoding
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
MCVE
import java.util.HashMap;
import java.util.Map;
/**
*https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
*https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
*https://www.w3.org/International/questions/qa-what-is-encoding
*https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
* @author sedri
*/
public class App {
static String outputString;
public static void main(String[] args) {
//May approach to fix the issue
//Use a map to replace string issue with the correct character
//The output looks good, but I would need to include all special characters for many languages.
//What if I have a sentence like: How old are thee?
Map<String, String> map = new HashMap();
map.put("e?", "é");
map.put("o^", "ó");
final String string = "Je?ro^me";
final String accentString = "Jéróme";
outputString = string;
map.forEach((t, u) -> {
if(outputString.contains(t))
{
outputString = outputString.replace(t, u);
}
});
System.out.println("Fixed output: " + outputString);
System.out.println("");
//End of my attempt at a solution.
System.out.println("code points: " + string.codePoints().count());
for(int i = 0; i < string.length(); i++)
{
System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
}
System.out.println("");
System.out.println("code points: " + accentString.codePoints().count());
for(int i = 0; i < accentString.length(); i++)
{
System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
}
System.out.println("");
System.out.println("code points: " + outputString.codePoints().count());
for(int i = 0; i < outputString.length(); i++)
{
System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
}
System.out.println("");
}
}