Regular expression for Multi Bytes string

Question

What could be the regular expression to detect a multi byte string.

For example here is the expression to detect a string in english

Pattern p=Pattern.compile("[a-zA-Z/]");

Similarly I want a pattern which has multi bytes like

コメント_1050_固-減価償却費

AFAIK, in Java UCS-2 is used, i.e. all strings are multibyte. You may input symbols with char `code > 127` just as Latin ones in their normal form: `ン` as well as in the following form: `\u30FC` — kirilloid, Mar 29 '12 at 07:25

score 3 · Accepted Answer · edited May 23 '17 at 11:44

You may want to have a look at Unicode Support in Java

I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".

So your regex could look like this

Pattern p=Pattern.compile("[\\p{L}/]");

I just replaced the character ranges a-zA-Z with \p{L}

Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes.

That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)

So to match your string コメント_1050_固-減価償却費, you could use

Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);

This would match any string consisting of letters, digits and _

See here for more details

and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.

See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)

But I tried following codes with JRE 8, it still wrong: String input = ""; Pattern p = Pattern.compile("[\\u2A775]", Pattern.UNICODE_CHARACTER_CLASS); System.out.println("In Range :" + p.matcher(input).find()); — Daniel Yang, Jul 07 '16 at 03:25

score 2 · Answer 2 · answered Mar 29 '12 at 07:34

2

If you want to detect whether you have a multi-byte strings you cna look at the length

if (text.length() != text.getBytes(encoding).length)

This will detect that a multi-byte character has been used for any encoding.

answered Mar 29 '12 at 07:34

Peter Lawrey

525,659
79
751
1,130

Its not as fast as some of the other suggestions, but its more general and you can be sure what it does. – Peter Lawrey Mar 29 '12 at 11:43

score 1 · Answer 3 · answered Mar 29 '12 at 07:28

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.

If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.

If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

score 0 · Answer 4 · answered Mar 29 '12 at 07:23

0

You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.

answered Mar 29 '12 at 07:23

npinti

51,780
5
72
96

score 0 · Answer 5 · answered Mar 29 '12 at 07:23

0

There is a nice introduction to UniCode regex here.

answered Mar 29 '12 at 07:23

David Brabant

41,623
16
83
111

Regular expression for Multi Bytes string

5 Answers5