What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費
What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費
You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)
If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.
 
    
    Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).
