Understanding unicode

Computers store and process information using numbers. In order to display this information as human-readable text, a character encoding is used to map the numbers to specific characters. Unicode is a standard for encoding, representing, and processing text in multiple languages.

How does unicode work?

The unicode standard assigns a code point, or number, to every character. For example, the code point for the letter A is U+0041 (hexadecimal), while the code point for the emoji grinning face with smiling eyes is U+1F601.

When a unicode-compliant program needs to display a character, it looks up the character's code point and displays the corresponding glyph. For example, the code point U+1F601 might be displayed as a grinning face emoji on one system, while on another system it might be displayed as a series of squares or a question mark if the system does not have a glyph for that code point.

Terminology

When talking about unicode, it's important to understand the difference between characters, code points, code units, graphemes, and glyphs.

  • A character is a abstract unit of text. It is an overloaded term that can mean different things in different contexts.
  • A code point is a number that uniquely identifies a character. For example, the code point for the letter 'A' is U+0041. Code points are written using the "U+" notation.
  • A code unit is the smallest logical piece of a character encoding. In UTF-8, code units are bytes. In UTF-16, code units are 16-bit words.
  • A grapheme, or grapheme cluster is the smallest visual unit of a character. For example, "a", "á" and "🌟” are all graphemes. A grapheme can have multiple code points and also multiple representations. For example, "á" can be encoded as two code points: U+0061 U+0301 (LATIN SMALL LETTER A + COMBINING ACUTE ACCENT) or as a single code point: U+00E1 (LATIN SMALL LETTER A WITH ACUTE).
  • A glyph is a specific graphical representation of a grapheme. The same grapheme can be represented by different glyphs on different systems, which explains why emojis can look different on different devices.

Unicode Transformation Formats (UTF)

There are four unicode text formats: UTF-8, UTF-16, UTF-32, and UCS-2.

  • UTF-8 is the most common character encoding. It uses one byte for ASCII characters, and up to four bytes for other characters.
  • UTF-16 is used by Java, JavaScript, and Windows. It uses two bytes for most characters, and four bytes for others.
  • UTF-32 is not widely used. It uses four bytes for all characters.
  • UCS-2 is a legacy encoding that is no longer recommended. It uses two bytes for all characters.

Each format has its own advantages and disadvantages. We'll go over the first three in more detail below.

UTF-8

UTF-8 is the most common character encoding. As a general rule, you should default to UTF-8. It uses one byte for most ASCII characters, and up to four bytes for other characters.

It has the nice property that it is backwards compatible with ASCII, as each ASCII character is encoded using a single byte. Effectively, this means that any ASCII string is also a valid UTF-8 string. UTF-8 is also the most space-efficient encoding, as it uses the minimum number of bytes for each character.

Its disadvantage is that it uses a variable number of bytes for each character so certain operations, such as finding the length or indexing into a string, become more complicated.

Num Bytes Code Point Bits Min. Code Point Max. Code Point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Here are some examples:

Character Code Point Byte 1 Byte 2 Byte 3 Byte 4
A U+0041 01000001
U+20AC 00100000 10101100 00100000
🌟 U+1F31F 00111110 00011111 00010010 11110000

The number of bytes is determined by the code point of the character. The most significant bit (MSB) of the first byte is always set to 0, while the MSB of subsequent bytes is set to 1. The remaining bits of each byte are used to store the code point data.

UTF-16

UTF-16 is used by Java, JavaScript, and Windows. It is a good choice for applications that heavily use characters from the Basic Multilingual Plane (BMP), which contains most characters used in the world's major languages.

It uses two bytes for most characters, and four bytes for others.

Num Bytes Code Point Bits Min. Code Point Max. Code Point Byte 1 Byte 2 Byte 3 Byte 4
2 16 U+0000 U+FFFF xxxxxxxx xxxxxxxx
4 21 U+10000 U+10FFFF 110110xx xxxxxxxxx 110111xx xxxxxxxxx

Here are some examples:

Character Code Point Byte 1 Byte 2 Byte 3 Byte 4
A U+0041 00000000 01000001
U+20AC 00100000 10101100
🌟 U+1F31F 11110000 10011111 00100000 11110001

UTF-16 is a variable-width encoding, which means that the number of bytes used for each character can vary. The first byte of a UTF-16 code unit is called the lead byte, while the second byte is called the trail byte.

If a code point is between U+0000 and U+FFFF, it is encoded as two bytes, with the lead byte being 0x00 and the trail byte being the code point itself. For example, the code point U+0041 (LATIN SMALL LETTER A) is encoded as 0x0041.

If a code point is between U+10000 and U+10FFFF, it is encoded as four bytes. The first two bytes are the high surrogate, while the last two bytes are the low surrogate. The high surrogate is in the range 0xD800-0xDBFF, while the low surrogate is in the range 0xDC00-0xDFFF.

To decode a UTF-16 code unit, we need to check if it is a high surrogate or low surrogate. If it is a high surrogate, we need to wait for the next code unit and check if it is a low surrogate. If it is, we can decode the two code units as a single code.

To calculate the code point, we use the following formula:

codePoint = (highSurrogate - 0xD800) * 0x400 + (lowSurrogate - 0xDC00) + 0x10000

Here is an example of how to decode the UTF-16 code units 0xD83D 0xDE00:

  • The first code unit, 0xD83D, is in the range 0xD800-0xDBFF, so we know it is a high surrogate.
  • The second code unit, 0xDE00, is in the range 0xDC00-0xDFFF, so we know it is a low surrogate.
  • We know that the two code units form a valid UTF-16 code point because the high surrogate is less than 0xDBFF and the low surrogate is greater than 0xDCFF.
  • We can calculate the code point using the formula, which gets us 0x1F600.

UTF-32

UTF-32 uses four bytes for all characters. It is not as widely used as the other two formats. Its a good choice if you need to support all Unicode characters or if you need random access to individual characters. Its disadvantage is that it is not as space efficient as other encodings.

Num Bytes Code Point Bits Min. Code Point Max. Code Point Byte 1 Byte 2 Byte 3 Byte 4
4 32 U+00000000 U+7FFFFFFF xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Here are some examples:

Character Code Point Byte 1 Byte 2 Byte 3 Byte 4
A U+0041 00000000 00000000 00000000 01000001
U+20AC 00100000 00000000 00000000 10101100
🌟 U+1F31F 00000000 00011111 00010001 11110011

The number of bytes is determined by the code point of the character. The most significant bit (MSB) of the first byte is always set to 0, while the MSB of subsequent bytes is set to 1. The remaining bits of each byte are used to store the code point data.

Unicode Normalization

One issue you may encounter when working with unicode text is that there can be multiple ways to represent the same character.

In order to compare or process text, we need to normalize it so that these different representations are treated as equivalent. Unicode defines equivalence by two criteria: compatibility and canonical equivalence.

  • Compatibility equivalence is when two characters have the same meaning but not necessarily similar visually. For example, "ff" (LATIN SMALL LIGATURE FF) is treated as compatible with "ff" (LATIN SMALL LETTER F + LATIN SMALL LETTER F) although they do not look the same.
  • Canonical equivalence is when two characters are visually similar. For example, "á" can be represented as two code points: U+0061 U+0301 (LATIN SMALL LETTER A + COMBINING ACUTE ACCENT) or as a single code point: U+00E1 (LATIN SMALL LETTER A WITH ACUTE).

Unicode defines four normalization forms: NFC, NFD, NFKC, and NFKD.

  • NFC (Canonical Composition) combines code points into a single code point if possible. For example, 'á' would be represented as U+00E1 (LATIN SMALL LETTER A WITH ACUTE).
  • NFD (Canonical Decomposition) splits code points into multiple code points if necessary. For example, 'á' would be represented as U+0061 U+0301 (LATIN SMALL LETTER A + COMBINING ACUTE ACCENT).
  • NFKC (Compatibility Composition) is similar to NFC but also applies compatibility decomposition. For example, "ff" (LATIN SMALL LIGATURE FF) will be normalized to "ff" (LATIN SMALL LETTER F + LATIN SMALL LETTER F).
  • NFKD (Compatibility Decomposition) is similar to NFD but also applies compatibility decomposition.

Which normalization form you choose depends on your needs. For example, if you need to compare two pieces of text for equality, you would use NFC. If you need to process a piece of text for compatibility, you would use NFKC or NFKD.

String Length

Calculation of the length of a string in unicode is one of the more confusing aspects. As we have discussed, there are various terminologies such as bytes, code points, graphemes and so on. Based on the terminology and context, the length of a string changes.

The length of a string in code points would vary depending on the encoding format used, and whether normalization is applied. For instance, the length of the emoji "🤦🏼‍♂️” would be 17, 7 and 5 in UTF-8, UTF-16 and UTF-32 respectively.

The length of a string in grapheme also depends on the version of unicode standard used, as the standard continually updates what consists as a grapheme.

References

https://hsivonen.fi/string-length/

https://www.unicode.org/reports/