UTF8 Encoding

in utf8, words can be 1-4 bytes long.

each word uses the first couple bits of each byte to signify that it's a command. there's probably some clever mathematics involved, but I'm not going to spend the time to go figure it out.

see http://en.wikipedia.org/wiki/UTF-8

U+0000 - U+007F stands for the simplest case, ascii values 0-127. all these values are a single byte, and the way you can tell is its first bit will be 0.

VALUE: 0xxxxxxx

U+0080 - U+07ff are the 2-byte characters. the first byte will begin with 110, and the second byte will begin with 10.

VALUE: 110yyyxx 10xxxxxx

U+0800 - 0xFFFF are the 3-byte characters. the first byte will begin with 1110, the second with 10, and the third again with 10.

VALUE: 1110yyyy 10yyyyxx 10xxxxxx

U+10000 - U+10FFFF are the 4-byte characters. the first byte will begin with 11110, and each byte thereafter begins with the 10 bits.

VALUE: 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx

Where x: lowest 8-bits
Where y: middle 8-bits
Where z: highest 5-bits

The maximum value that can be represented is x^(5+8+8). However, for some reason, the UTF8 standard only uses 0x00 - 0x10 for the highest bits. meaning the maximum value for any UTF8 (right now) is 0x10FFFF.

See the next post for how to handle these values in Java.

No comments: