Character Issues

Question

Back Story

I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.

The Issue

These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.

"Je?ro^me" //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
"Jéróme" //code-points 6.   Code-points: 74,   233,   114,    243,  109, 101

The question

How do I get "Je?ro^me" back to "Jéróme"?

Things that I have tried

Used Notepad++ to convert the encoding to or from UTF8, ANSI, and WINDOWS-1252.
Created a Map that looks for things like e? and convert them to é.

Issues with the two attempts to solve the problem

a. The issue still existed after trying different conversions.

b. Two issues here:

I don't know all of the potential e?, o^, etc to look for. There are over 20,000 files that may cover many languages.
What if I have a sentence that ends in e?

Things I researched to gain a better understanding of the issue

MCVE

import java.util.HashMap;
import java.util.Map;

/**
 *https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
 *https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
 *https://www.w3.org/International/questions/qa-what-is-encoding
 *https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
 * @author sedri
 */
public class App {
    
    static String outputString; 
    
    public static void main(String[] args) {
        
        //May approach to fix the issue
        //Use a map to replace string issue with the correct character
        //The output looks good, but I would need to include all special characters for many languages.
        //What if I have a sentence like: How old are thee? 
        Map<String, String> map = new HashMap();
        map.put("e?", "é");
        map.put("o^", "ó");
        
        final String string = "Je?ro^me";
        final String accentString = "Jéróme";
        outputString = string;
        map.forEach((t, u) -> {
            if(outputString.contains(t))
            {
                outputString = outputString.replace(t, u);
            }
        });
        System.out.println("Fixed output: " + outputString);        
        System.out.println("");                    
        //End of my attempt at a solution.
        
        System.out.println("code points: " + string.codePoints().count());                
        for(int i = 0; i < string.length(); i++)
        {
            System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
        }
        System.out.println("");    
        
        System.out.println("code points: " + accentString.codePoints().count());                
        for(int i = 0; i < accentString.length(); i++)
        {
            System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
        }
        System.out.println("");    
          
        System.out.println("code points: " + outputString.codePoints().count());  
        for(int i = 0; i < outputString.length(); i++)
        {
            System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
        }        
        System.out.println("");  
    }
}

kshetline · Accepted Answer · 2020-09-02T18:45:36.847

2

The fact that one of your code points is 63 (a question mark) means that you won't be able to reliably revert that data to the original format. The ? can represent many different characters that weren't properly decoded, which means you've lost vital information for restoring the original characters.

What you need to do is establish the correct encoding to use when you read from your database in the first place. Since you haven't posted the code where you read these strings, I can't tell you exactly how or where to do that.

Hopefully the data in the DB itself hasn't already been corrupted by bad character encoding, or else you've already lost the information you need.

You might be able to partially repair such damage by doing things like replacing "o^" with "ó", but if, say, both "è" and "é" turn into "e?", you can never be sure which was which.

edited Sep 02 '20 at 18:45

answered Sep 02 '20 at 17:58

kshetline

12,547
4
37
73

Thanks, for the answer. The DB data is corrupt. That's why I was hoping for a way to revert the string back. It looks like this is going to need be to done manually. I had the original string from the DB but ran the program again and those strings were replaced with the bad strings. Any clues on how to identify strings that are bad? – SedJ601 Sep 02 '20 at 18:04
1

The only way I can think of fixing this is by looking for oddly-placed characters, like the embedded question marks and circumflexes and the like, then doing your best-guess replacements of them. If you're good at using regexes, that'll make your life a lot easier. – kshetline Sep 02 '20 at 18:42

Character Issues

1 Answers1