unicode letter or character frequency in php or vb.net

Question

AM doing a letter frequency analyzer program just like this website http://www.characterfrequencyanalyzer.com/, but i don't know the right algorithm. Of course simple looping should work if it where just english letters, but the trick is it should work with unicode letters like arabic ,chinese etc.

how do i do this? if i can get sample code in vb.net or php, or an algol, i would be thankful.

thanks

score 1 · Accepted Answer · answered Feb 11 '11 at 08:58

1

well ... maybe you should ask yourself more prcisely what it really is that you want to measure, as chinese got no 'letters' for example.

why not just using an associative array (character code -> counter)?

answered Feb 11 '11 at 08:58

Raffael

19,547
15
82
160

thanks, foget chinese. can you explain your method more with code php or vb.net – Smith Feb 11 '11 at 09:23
+1 for "ask yourself more precisely". Many, many languages have accents, like é. In Unicode é can be represented as one 'character' or *code point* `U+00E9` or as **two** code points: standard lower case e `U+0065` and then a combining character for the accent `U+0301`. Smith, as I've said in comments on your other questions, if you want to go beyond English, you absolutely **must** learn about text encodings - there are great articles [out](http://www.joelonsoftware.com/articles/Unicode.html) [there](http://msmvps.com/blogs/jon_skeet/archive/2009/11/02/omg-ponies-aka-humanity-epic-fail.aspx). – MarkJ Feb 11 '11 at 13:31
thing is I am not an expert for unicode, so I can't give a solution here. But every Unicode-character is identifiable by a 'unicode' that again you can use as your array-key. But getting that to work might be tricky, b/c every step ( something like site->server->php->script) relies on an appropriate and consistent handling of the character set. – Raffael Feb 11 '11 at 13:33
... or there's more links about Unicode [here](http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode) – MarkJ Feb 11 '11 at 13:33
@Raffael1984 @Smith No offence meant to anyone, but PHP has very poor support for Unicode. The question asks for solutions in PHP or VB.Net or algol. It'll be much, much easier in VB.Net than PHP (I don't know algol). For instance see my answer to Smith's [duplicate question here](http://stackoverflow.com/questions/4956255/extract-arabic-letter-from-sentence-or-word). BTW Raffael as it happens I am using an associative array like you suggest! The hard part at the moment is determining the text encoding of the original data file >: – MarkJ Feb 11 '11 at 13:38
Yeah, that's where the problem starts ... also there are many different kinds of unicode. Not sure how that affects the identification of a character but one has to make sure consitency is assured. It's definitely doable in PHP, unicode is just not very trivial. – Raffael Feb 11 '11 at 13:41

unicode letter or character frequency in php or vb.net

1 Answers1