After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter
All the examples I was able to find were using the static version of the method foldToASCII as in this project:
private static String getFoldedString(String text) {
char[] textChar = text.toCharArray();
char[] output = new char[textChar.length * 4];
int outputPos = ASCIIFoldingFilter.foldToASCII(textChar, 0, output, 0, textChar.length);
text = new String(output, 0, outputPos);
return text;
}
However that static method has a note on it saying
This API is for internal purposes only and might change in incompatible ways in the next release.
So after some trial and error I came up with this version that avoids using the static method:
public static String getFoldedString(String text) throws IOException {
String output = "";
try (Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(KeywordTokenizerFactory.class)
.addTokenFilter(ASCIIFoldingFilterFactory.class)
.build()) {
try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
if (ts.incrementToken()) output = charTermAtt.toString();
ts.end();
}
}
return output;
}
Similar to an answer I provided here.
This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.
However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are 0x80 through 0x9f. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:
public static String replaceInvalidChars(String text) {
char input[] = text.toCharArray();
int length = input.length;
char output[] = new char[length * 6];
int outputPos = 0;
for (int pos = 0; pos < length; pos++) {
final char c = input[pos];
if (c < '\u0080') {
output[outputPos++] = c;
} else {
switch (c) {
case '\u20ac': //€ 0x80
output[outputPos++] = 'E';
output[outputPos++] = 'U';
output[outputPos++] = 'R';
break;
case '\u201a': //‚ 0x82
output[outputPos++] = '\'';
break;
case '\u0192': //ƒ 0x83
output[outputPos++] = 'f';
break;
case '\u201e': //„ 0x84
output[outputPos++] = '"';
break;
case '\u2026': //… 0x85
output[outputPos++] = '.';
output[outputPos++] = '.';
output[outputPos++] = '.';
break;
case '\u2020': //† 0x86
output[outputPos++] = '?';
break;
case '\u2021': //‡ 0x87
output[outputPos++] = '?';
break;
case '\u02c6': //ˆ 0x88
output[outputPos++] = '^';
break;
case '\u2030': //‰ 0x89
output[outputPos++] = 'p';
output[outputPos++] = 'e';
output[outputPos++] = 'r';
output[outputPos++] = 'm';
output[outputPos++] = 'i';
output[outputPos++] = 'l';
break;
case '\u0160': //Š 0x8a
output[outputPos++] = 'S';
break;
case '\u2039': //‹ 0x8b
output[outputPos++] = '\'';
break;
case '\u0152': //Œ 0x8c
output[outputPos++] = 'O';
output[outputPos++] = 'E';
break;
case '\u017d': //Ž 0x8e
output[outputPos++] = 'Z';
break;
case '\u2018': //‘ 0x91
output[outputPos++] = '\'';
break;
case '\u2019': //’ 0x92
output[outputPos++] = '\'';
break;
case '\u201c': //“ 0x93
output[outputPos++] = '"';
break;
case '\u201d': //” 0x94
output[outputPos++] = '"';
break;
case '\u2022': //• 0x95
output[outputPos++] = '-';
break;
case '\u2013': //– 0x96
output[outputPos++] = '-';
break;
case '\u2014': //— 0x97
output[outputPos++] = '-';
break;
case '\u02dc': //˜ 0x98
output[outputPos++] = '~';
break;
case '\u2122': //™ 0x99
output[outputPos++] = '(';
output[outputPos++] = 'T';
output[outputPos++] = 'M';
output[outputPos++] = ')';
break;
case '\u0161': //š 0x9a
output[outputPos++] = 's';
break;
case '\u203a': //› 0x9b
output[outputPos++] = '\'';
break;
case '\u0153': //œ 0x9c
output[outputPos++] = 'o';
output[outputPos++] = 'e';
break;
case '\u017e': //ž 0x9e
output[outputPos++] = 'z';
break;
case '\u0178': //Ÿ 0x9f
output[outputPos++] = 'Y';
break;
default:
output[outputPos++] = c;
break;
}
}
}
return new String(Arrays.copyOf(output, outputPos));
}
Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.