Using str_word_count for UTF8 texts

Question

I have this text:

$text  = "Başka, küskün otomobil kaçtı buraya küskün otomobil neden kaçtı
          kaçtı buraya, oraya KISMEN @here #there J.J.Johanson hep.
          Danny:Where is mom? I don't know! Café est weiß for 2 €uros.
          My 2nd nickname is mike18.";

Recently I was using this.

$a1= array_count_values(str_word_count($text, 1, 'ÇçÖöŞşİIıĞğÜü@#éß€1234567890'));
arsort($a1);

You can check with this fiddle:
http://ideone.com/oVUGYa

But this solution doesn't solve all UTF8 problems. I can't write whole UTF8 set into str_word_count as parameter.

So I created this:

$wordsArray = explode(" ",$text);
foreach ($wordsArray as $k => $w) {
    $wordsArray[$k] = str_replace(array(",","."),"",$w);
}
$wordsArray2 = array_count_values($wordsArray);
arsort($wordsArray2);

Output should be like this:

Array (
 [kaçtı] => 3
 [küskün] => 2
 [buraya] => 2
 [@here] => 1
 [#there] => 1
 [Danny] => 1
 [mom] => 1
 [don't] => 1
 [know] => 1
 ...
 ...
)

This works well but it doesn't cover all sentence-word problems. For example I removed comma and dots with str_replace.

For example this solution doesn't cover the words like this: Hello Mike,how are you ? Mike and how won't be treated as different words.

This doesn't covered in str_word_count solution: KISMEN @here #there. At and dash sign and won't be taken into consideration.

This will not be covered J.J.Johanson. Although it is a word, it will be treated as JJJohanson

Question, exclamation signs should be removed from words.

Is there a better way to get str_word_count behaviour with UTF8 support ? The $text which exists in the top of this question is reference for me.

(It would be better if you can provide a fiddle with your answer)

I can think of some solutions... but they would mean you get `here` & `there` instead of `@here` & `#there`, would this be acceptable? — Wrikken, Feb 09 '14 at 01:22
Unfortunately I don't prefer to lose `@here` & and `#there`. Because mostly we analyze tweets. — trante, Feb 09 '14 at 12:51
read this also: http://stackoverflow.com/questions/8290537/is-php-str-word-count-multibyte-safe — , Feb 18 '14 at 02:06

Dennis C · Answer 1 · 2017-02-02T05:56:12.573

7

You will never have a prefect solution of word-count, because word-count concept is not exists or too difficult in some languages. UTF8 or not does not matter.

Japanese and Chinese are not space tokenism language. They even don't have a static word list, you have to read the whole sentence before find verb and noun.

If you want to support multiple languages, you will need language specific tokenizer engine. You may research full-text index, tokenizer, CJK-tokenizer, CJK-analyzer for more information.

If you only want to support limited selected languages, just improve your regex patters with more and more cases.

edited Feb 02 '17 at 05:56

answered Feb 18 '14 at 06:13

Dennis C

24,511
12
71
99

Korean IS space tokenized, strictly. – Calvin Caulfield Dec 27 '15 at 03:44
Actually in Chinese, we seldom count "words" of something and just count characters of it only :) (e.g. 桌子 we will count as 2 words instead of 1) – He Yifei 何一非 Jan 01 '16 at 12:53
1

@Arefly there are two chars and one noun in your example. A noun is a word. I think it is easier to explain to non-chinese speakers by saying it is a word. – Dennis C Jan 02 '16 at 11:22

klugerama · Answer 2 · 2014-02-11T21:22:52.883

I think you're sort of on the right track with explode, but that doesn't handle regex.

Change your code to:

$namePattern = '/[\s,:?!]+/u';
$wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY);
$wordsArray2 = array_count_values($wordsArray);
arsort($wordsArray2);
print_r($wordsArray2);

Of course you may need to tweak the regex ($regexPattern) to meet your needs.

Fiddle: http://ideone.com/JoIJqv

Using str_word_count for UTF8 texts

2 Answers2

Linked

Related