Understanding unicode

Published - 8 min read

Computers store and process information using numbers. In order to display this information as human-readable text, a character encoding is used to map the numbers to specific characters. Unicode is a standard for encoding, representing, and processing text in multiple languages.

How does unicode work?

The unicode standard assigns a code point, or number, to every character. For example, the code point for the letter A is U+0041 (hexadecimal), while the code point for the emoji grinning face with smiling eyes is U+1F601.

When a unicode-compliant program needs to display a character, it looks up the character’s code point and displays the corresponding glyph. For example, the code point U+1F601 might be displayed as a grinning face emoji on one system, while on another system it might be displayed as a series of squares or a question mark if the system does not have a glyph for that code point.

Terminology

When talking about unicode, it’s important to understand the difference between characters, code points, code units, graphemes, and glyphs.

Unicode Transformation Formats (UTF)

There are four unicode text formats: UTF-8, UTF-16, UTF-32, and UCS-2.

Each format has its own advantages and disadvantages. We’ll go over the first three in more detail below.

UTF-8

UTF-8 is the most common character encoding. As a general rule, you should default to UTF-8. It uses one byte for most ASCII characters, and up to four bytes for other characters.

It has the nice property that it is backwards compatible with ASCII, as each ASCII character is encoded using a single byte. Effectively, this means that any ASCII string is also a valid UTF-8 string. UTF-8 is also the most space-efficient encoding, as it uses the minimum number of bytes for each character.

Its disadvantage is that it uses a variable number of bytes for each character so certain operations, such as finding the length or indexing into a string, become more complicated.

Num BytesCode Point BitsMin. Code PointMax. Code PointByte 1Byte 2Byte 3Byte 4
17U+0000U+007F0xxxxxxx
211U+0080U+07FF110xxxxx10xxxxxx
316U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
421U+10000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

Here are some examples:

CharacterCode PointByte 1Byte 2Byte 3Byte 4
AU+004101000001
U+20AC001000001010110000100000
🌟U+1F31F00111110000111110001001011110000

The number of bytes is determined by the code point of the character. The most significant bit (MSB) of the first byte is always set to 0, while the MSB of subsequent bytes is set to 1. The remaining bits of each byte are used to store the code point data.

UTF-16

UTF-16 is used by Java, JavaScript, and Windows. It is a good choice for applications that heavily use characters from the Basic Multilingual Plane (BMP), which contains most characters used in the world’s major languages.

It uses two bytes for most characters, and four bytes for others.

Num BytesCode Point BitsMin. Code PointMax. Code PointByte 1Byte 2Byte 3Byte 4
216U+0000U+FFFFxxxxxxxxxxxxxxxx
421U+10000U+10FFFF110110xxxxxxxxxxx110111xxxxxxxxxxx

Here are some examples:

CharacterCode PointByte 1Byte 2Byte 3Byte 4
AU+00410000000001000001
U+20AC0010000010101100
🌟U+1F31F11110000100111110010000011110001

UTF-16 is a variable-width encoding, which means that the number of bytes used for each character can vary. The first byte of a UTF-16 code unit is called the lead byte, while the second byte is called the trail byte.

If a code point is between U+0000 and U+FFFF, it is encoded as two bytes, with the lead byte being 0x00 and the trail byte being the code point itself. For example, the code point U+0041 (LATIN SMALL LETTER A) is encoded as 0x0041.

If a code point is between U+10000 and U+10FFFF, it is encoded as four bytes. The first two bytes are the high surrogate, while the last two bytes are the low surrogate. The high surrogate is in the range 0xD800-0xDBFF, while the low surrogate is in the range 0xDC00-0xDFFF.

To decode a UTF-16 code unit, we need to check if it is a high surrogate or low surrogate. If it is a high surrogate, we need to wait for the next code unit and check if it is a low surrogate. If it is, we can decode the two code units as a single code.

To calculate the code point, we use the following formula:

codePoint = (highSurrogate - 0xD800) * 0x400 + (lowSurrogate - 0xDC00) + 0x10000

Here is an example of how to decode the UTF-16 code units 0xD83D 0xDE00:

UTF-32

UTF-32 uses four bytes for all characters. It is not as widely used as the other two formats. Its a good choice if you need to support all Unicode characters or if you need random access to individual characters. Its disadvantage is that it is not as space efficient as other encodings.

Num BytesCode Point BitsMin. Code PointMax. Code PointByte 1Byte 2Byte 3Byte 4
432U+00000000U+7FFFFFFFxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Here are some examples:

CharacterCode PointByte 1Byte 2Byte 3Byte 4
AU+004100000000000000000000000001000001
U+20AC00100000000000000000000010101100
🌟U+1F31F00000000000111110001000111110011

The number of bytes is determined by the code point of the character. The most significant bit (MSB) of the first byte is always set to 0, while the MSB of subsequent bytes is set to 1. The remaining bits of each byte are used to store the code point data.

Unicode Normalization

One issue you may encounter when working with unicode text is that there can be multiple ways to represent the same character.

In order to compare or process text, we need to normalize it so that these different representations are treated as equivalent. Unicode defines equivalence by two criteria: compatibility and canonical equivalence.

Unicode defines four normalization forms: NFC, NFD, NFKC, and NFKD.

Which normalization form you choose depends on your needs. For example, if you need to compare two pieces of text for equality, you would use NFC. If you need to process a piece of text for compatibility, you would use NFKC or NFKD.

String Length

Calculation of the length of a string in unicode is one of the more confusing aspects. As we have discussed, there are various terminologies such as bytes, code points, graphemes and so on. Based on the terminology and context, the length of a string changes.

The length of a string in code points would vary depending on the encoding format used, and whether normalization is applied. For instance, the length of the emoji ”🤦🏼‍♂️” would be 17, 7 and 5 in UTF-8, UTF-16 and UTF-32 respectively.

The length of a string in grapheme also depends on the version of unicode standard used, as the standard continually updates what consists as a grapheme.

References