This Content is from Stack Overflow. Question asked by zeus
I have 3 bytes representing an unicode char encoded in utf8. For example I have
E2 82 AC (UTF8) that represent the unicode char
€ (U+20AC). Is their any algorithm to make this conversion? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC. So is the mapping between utf8 -> utf16 a simple mathematic function or if it’s a hardcoded map.
Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.
Validating a UTF-8 byte sequence is trivial: simply check that the first byte matches one of the patterns below, and that
(byte and $C0) = $80 is true for each subsequent byte in the sequence.
The first byte in a UTF-8 sequence tells you how many bytes are in the sequence:
(byte1 and $80) = $00: 1 byte (byte1 and $E0) = $C0: 2 bytes (byte1 and $F0) = $E0: 3 bytes (byte1 and $F8) = $F0: 4 bytes anything else: error
There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below
U+10000, and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, just some bit twiddling, eg:
UTF16 = UInt16(byte1 and $7F)
UTF16 = (UInt16(byte1 and $1F) shl 6) or UInt16(byte2 and $3F)
UTF16 = (UInt16(byte1 and $0F) shl 12) or (UInt16(byte2 and $3F) shl 6) or UInt16(byte3 and $3F)
Converting a UTF-8 4-byte sequence to UTF-16, on the other hand, is slightly more involved, since it represents a Unicode code point that is
U+10000 or higher, and thus will need to use UTF-16 surrogates, which requires some additional math to calculate, eg:
CP = (UInt32(byte1 and $07) shl 18) or (UInt32(byte2 and $3F) shl 12) or (UInt32(byte3 and $3F) shl 6) or UInt32(byte4 and $3F) CP = CP - $10000 highSurrogate = $D800 + UInt16((CP shr 10) and $3FF) lowSurrogate = $DC00 + UInt16(CP and $3FF) UTF16 = highSurrogate, lowSurrogate
Now, with that said, let’s look at your example:
E2 82 AC
The first byte is
($E2 and $F0) = $E0, the second byte is
($82 and $C0) = $80, and the third byte is
($AC and $C0) = $80, so this is indeed a valid UTF-8 3-byte sequence.
Plugging in those byte values into the 3-byte formula, you get:
UTF16 = (UInt16($E2 and $0F) shl 12) or (UInt16($82 and $3F) shl 6) or UInt16($AC and $3F) = (UInt16($02) shl 12) or (UInt16($02) shl 6) or UInt16($2C) = $2000 or $80 or $2C = $20AC
And indeed, Unicode codepoint
U+20AC is encoded in UTF-16 as
This Question was asked in StackOverflow by zeus and Answered by Remy Lebeau It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.