Issue

This Content is from Stack Overflow. Question asked by zeus

I have 3 bytes representing an unicode char encoded in utf8. For example I have `E2 82 AC` (UTF8) that represent the unicode char `€ (U+20AC)`. Is their any algorithm to make this conversion? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC. So is the mapping between utf8 -> utf16 a simple mathematic function or if it’s a hardcoded map.

Solution

Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.

Validating a UTF-8 byte sequence is trivial: simply check that the first byte matches one of the patterns below, and that `(byte and \$C0) = \$80` is true for each subsequent byte in the sequence.

The first byte in a UTF-8 sequence tells you how many bytes are in the sequence:

``````(byte1 and \$80) = \$00: 1 byte
(byte1 and \$E0) = \$C0: 2 bytes
(byte1 and \$F0) = \$E0: 3 bytes
(byte1 and \$F8) = \$F0: 4 bytes
anything else: error
``````

There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below `U+10000`, and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, just some bit twiddling, eg:

1 byte:

``````UTF16 = UInt16(byte1 and \$7F)
``````

2 bytes:

``````UTF16 = (UInt16(byte1 and \$1F) shl 6)
or UInt16(byte2 and \$3F)
``````

3 bytes:

``````UTF16 = (UInt16(byte1 and \$0F) shl 12)
or (UInt16(byte2 and \$3F) shl 6)
or UInt16(byte3 and \$3F)
``````

Converting a UTF-8 4-byte sequence to UTF-16, on the other hand, is slightly more involved, since it represents a Unicode code point that is `U+10000` or higher, and thus will need to use UTF-16 surrogates, which requires some additional math to calculate, eg:

4 bytes:

``````CP = (UInt32(byte1 and \$07) shl 18)
or (UInt32(byte2 and \$3F) shl 12)
or (UInt32(byte3 and \$3F) shl 6)
or UInt32(byte4 and \$3F)
CP = CP - \$10000
highSurrogate = \$D800 + UInt16((CP shr 10) and \$3FF)
lowSurrogate = \$DC00 + UInt16(CP and \$3FF)
UTF16 = highSurrogate, lowSurrogate
``````

Now, with that said, let’s look at your example: `E2 82 AC`

The first byte is `(\$E2 and \$F0) = \$E0`, the second byte is `(\$82 and \$C0) = \$80`, and the third byte is `(\$AC and \$C0) = \$80`, so this is indeed a valid UTF-8 3-byte sequence.

Plugging in those byte values into the 3-byte formula, you get:

``````UTF16 = (UInt16(\$E2 and \$0F) shl 12)
or (UInt16(\$82 and \$3F) shl 6)
or UInt16(\$AC and \$3F)

= (UInt16(\$02) shl 12)
or (UInt16(\$02) shl 6)
or UInt16(\$2C)

= \$2000
or \$80
or \$2C

= \$20AC
``````

And indeed, Unicode codepoint `U+20AC` is encoded in UTF-16 as `\$20AC`.

``` This Question was asked in  StackOverflow by  zeus and Answered by Remy Lebeau It is licensed under the terms of
CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.```