Text can be counted at different levels of Unicode representation. Each counter type measures a different aspect of how text is stored and processed:
| Counter Type | What It Measures | Example |
|---|---|---|
| Graphemes | User-perceived characters | "π" = 1 |
| Code Points | Unicode characters | "Γ©" = 1 or 2 (depends on form) |
| UTF-16 Units | JavaScript string length | "π" = 2 |
| UTF-8 Bytes | Storage size | "π" = 4 |
A grapheme cluster (or "extended grapheme cluster" per Unicode UAX #29) is what users perceive as a single character. One grapheme may consist of multiple code points.
Examples:
Grapheme clusters are counted using the Intl.Segmenter API with granularity: 'grapheme', which correctly handles emojis, combining marks, and complex scripts like Thai or Devanagari.
A code point is a unique number assigned to each character in the Unicode standard, written as U+XXXX in hexadecimal. The Unicode codespace ranges from U+0000 to U+10FFFF.
Examples:
Code points are counted using Array.from(str).length, which correctly iterates over surrogate pairs.
UTF-16 is a variable-width encoding using 16-bit code units. Code points in the Basic Multilingual Plane (U+0000βU+FFFF) use one unit, while code points above U+FFFF require a surrogate pair (two units).
Examples:
This is what str.length returns in JavaScript. Many systems use UTF-16 internally (JavaScript, Java, Windows APIs), so this count is important for buffer sizes and API limits.
UTF-8 is a variable-width encoding where each code point uses 1 to 4 bytes:
Examples:
UTF-8 is the dominant encoding for the web, files, and network protocols. This count shows the actual storage/transfer size of your text.
Unicode normalization converts text to a standard form. NFC (Canonical Composition) combines characters when possible.
Example:
When the "Normalize to NFC" option is enabled, all counts are calculated on the normalized text. This can reduce code point and byte counts for text with combining marks.
Example 1: Simple text
Text: "Hello"
Example 2: With emoji
Text: "Hello π"
Example 3: Flag emoji
Text: "πΈπͺ" (Swedish flag)
Example 4: With combining marks
Text: "cafΓ©" (with decomposed Γ© as e + combining acute)
After NFC normalization: