Nerd Posts: UTF8 Encoding

in utf8, words can be 1-4 bytes long.

each word uses the first couple bits of each byte to signify that it's a command. there's probably some clever mathematics involved, but I'm not going to spend the time to go figure it out.

see http://en.wikipedia.org/wiki/UTF-8

U+0000 - U+007F stands for the simplest case, ascii values 0-127. all these values are a single byte, and the way you can tell is its first bit will be 0.

VALUE: 0xxxxxxx

U+0080 - U+07ff are the 2-byte characters. the first byte will begin with 110, and the second byte will begin with 10.

VALUE: 110yyyxx 10xxxxxx

U+0800 - 0xFFFF are the 3-byte characters. the first byte will begin with 1110, the second with 10, and the third again with 10.

VALUE: 1110yyyy 10yyyyxx 10xxxxxx

U+10000 - U+10FFFF are the 4-byte characters. the first byte will begin with 11110, and each byte thereafter begins with the 10 bits.

VALUE: 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx

Where x: lowest 8-bits
Where y: middle 8-bits
Where z: highest 5-bits

The maximum value that can be represented is x^(5+8+8). However, for some reason, the UTF8 standard only uses 0x00 - 0x10 for the highest bits. meaning the maximum value for any UTF8 (right now) is 0x10FFFF.

See the next post for how to handle these values in Java.

UTF8 Encoding

No comments: