How to decode unicode byte stream into characters

Question

I am writing a server program where I am reading an UTF-8 encoded byte stream from a network socket and continuously interpreting these characters.

For characters which take more than one bytes to represent, sometime I just receive first byte of the character on the socket and the program interprets this byte to an invalid character.

For example, client runs below code:-

  String s = "Cañ";

  byte[] b = s.getBytes("UTF-8");

  //sending first three bytes
  send(b, 0, 3));   //send(byte[], offset, length)

  //sending last byte
  send(b, 3, 1);

When server receives first three bytes, it decodes them to Ca?.

How can i detect character boundaries on server?

The code given is made up to produce the issue. The character is broken by TCP sometimes, I believe.

possible duplicate of http://stackoverflow.com/questions/8512121/utf-8-byte-to-string — Yoav Gur, May 19 '17 at 05:12
How exactly does the server "receive" the bytes? When reading character data you should not try reading a raw `InputStream` but rather wrap it up in an `InputStreamReader` that knows about things like characters and UTF-8 — piet.t, May 19 '17 at 06:17

score 0 · Answer 1 · answered May 19 '17 at 05:19

0

The TCP protocol is reliable, you may lost some packet sometimes if the network jams. U can design a protocol yourself.By setting the first and last tag of your protocol data frame, you can check whether you have received the full data easily.

answered May 19 '17 at 05:19

dawnfly

93
2

How to decode unicode byte stream into characters

1 Answers1