Understanding Unicode Counters

Overview

Text can be counted at different levels of Unicode representation. Each counter type measures a different aspect of how text is stored and processed:

Counter Type What It Measures Example
Graphemes User-perceived characters "πŸ‘‹" = 1
Code Points Unicode characters "Γ©" = 1 or 2 (depends on form)
UTF-16 Units JavaScript string length "πŸ‘‹" = 2
UTF-8 Bytes Storage size "πŸ‘‹" = 4

Grapheme Clusters

A grapheme cluster (or "extended grapheme cluster" per Unicode UAX #29) is what users perceive as a single character. One grapheme may consist of multiple code points.

Examples:

  • "πŸ‘‹" = 1 grapheme (one emoji)
  • "Γ©" = 1 grapheme (whether composed as U+00E9 or decomposed as e + β—ŒΜ)
  • "πŸ‘¨β€πŸ‘©β€πŸ‘§" = 1 grapheme (family emoji: 3 people joined by zero-width joiners)
  • "πŸ‡ΈπŸ‡ͺ" = 1 grapheme (flag emoji: 2 regional indicator code points)

Grapheme clusters are counted using the Intl.Segmenter API with granularity: 'grapheme', which correctly handles emojis, combining marks, and complex scripts like Thai or Devanagari.

Unicode Code Points

A code point is a unique number assigned to each character in the Unicode standard, written as U+XXXX in hexadecimal. The Unicode codespace ranges from U+0000 to U+10FFFF.

Examples:

  • "A" = 1 code point (U+0041)
  • "Γ©" (composed) = 1 code point (U+00E9)
  • "Γ©" (decomposed) = 2 code points (U+0065 + U+0301)
  • "πŸ‘‹" = 1 code point (U+1F44B)
  • "πŸ‘¨β€πŸ‘©β€πŸ‘§" = 5 code points (3 emoji + 2 zero-width joiners)

Code points are counted using Array.from(str).length, which correctly iterates over surrogate pairs.

UTF-16 Code Units

UTF-16 is a variable-width encoding using 16-bit code units. Code points in the Basic Multilingual Plane (U+0000–U+FFFF) use one unit, while code points above U+FFFF require a surrogate pair (two units).

Examples:

  • "A" (U+0041) = 1 UTF-16 unit
  • "δΈ­" (U+4E2D) = 1 UTF-16 unit
  • "πŸ‘‹" (U+1F44B) = 2 UTF-16 units (surrogate pair)
  • "Hello" = 5 UTF-16 units

This is what str.length returns in JavaScript. Many systems use UTF-16 internally (JavaScript, Java, Windows APIs), so this count is important for buffer sizes and API limits.

UTF-8 Bytes

UTF-8 is a variable-width encoding where each code point uses 1 to 4 bytes:

  • U+0000–U+007F (ASCII): 1 byte
  • U+0080–U+07FF: 2 bytes
  • U+0800–U+FFFF: 3 bytes
  • U+10000–U+10FFFF: 4 bytes

Examples:

  • "A" (U+0041) = 1 UTF-8 byte
  • "Γ©" (U+00E9) = 2 UTF-8 bytes
  • "δΈ­" (U+4E2D) = 3 UTF-8 bytes
  • "πŸ‘‹" (U+1F44B) = 4 UTF-8 bytes

UTF-8 is the dominant encoding for the web, files, and network protocols. This count shows the actual storage/transfer size of your text.

Normalization (NFC)

Unicode normalization converts text to a standard form. NFC (Canonical Composition) combines characters when possible.

Example:

  • "Γ©" (decomposed: e + β—ŒΜ) β†’ "Γ©" (composed: Γ©)
  • Before normalization: 2 code points, 3 UTF-8 bytes
  • After normalization: 1 code point, 2 UTF-8 bytes

When the "Normalize to NFC" option is enabled, all counts are calculated on the normalized text. This can reduce code point and byte counts for text with combining marks.

Real-World Examples

Example 1: Simple text

Text: "Hello"

  • Graphemes: 5
  • Code Points: 5
  • UTF-16 Units: 5
  • UTF-8 Bytes: 5

Example 2: With emoji

Text: "Hello πŸ‘‹"

  • Graphemes: 7 (H-e-l-l-o-space-πŸ‘‹)
  • Code Points: 7
  • UTF-16 Units: 8 (emoji uses surrogate pair)
  • UTF-8 Bytes: 10 (emoji uses 4 bytes)

Example 3: Flag emoji

Text: "πŸ‡ΈπŸ‡ͺ" (Swedish flag)

  • Graphemes: 1 (one visible flag)
  • Code Points: 2 (Regional Indicator S + Regional Indicator E)
  • UTF-16 Units: 4 (each regional indicator needs surrogate pair)
  • UTF-8 Bytes: 8 (each regional indicator uses 4 bytes)

Example 4: With combining marks

Text: "cafΓ©" (with decomposed Γ© as e + combining acute)

  • Graphemes: 4 (c-a-f-Γ©)
  • Code Points: 5 (c + a + f + e + β—ŒΜ)
  • UTF-16 Units: 5
  • UTF-8 Bytes: 6 (combining acute uses 2 bytes)

After NFC normalization:

  • Graphemes: 4
  • Code Points: 4 (Γ© becomes single code point U+00E9)
  • UTF-16 Units: 4
  • UTF-8 Bytes: 5