Unicode Ranges and cmap Tables
Understand how fonts map Unicode code points to glyphs, how unicode-range optimizes web font loading, and how subsetting affects character coverage.
Key Takeaways
- • The cmap table maps Unicode code points to glyph IDs
- • CSS unicode-range enables selective font loading by browsers
- • Subsetting removes characters and updates cmap accordingly
- • CJK fonts benefit greatly from unicode-range splitting
In this article
Every font needs a way to connect typed characters to visual glyphs. The cmap (character map) table is this bridge—it tells the text engine which glyph to display for each Unicode code point. Understanding cmap and Unicode ranges helps you optimize fonts for web delivery and troubleshoot rendering issues.
Unicode ranges also matter for web performance. The CSS unicode-range descriptor lets browsers download font files only when the page actually contains characters in that range. For multilingual sites or fonts with large character sets, this dramatically reduces unnecessary downloads.
The cmap Table Structure
The cmap table contains one or more subtables, each supporting a different platform/encoding combination. Modern fonts typically include:
cmap Table Structure
├── Header
│ ├── Version: 0
│ └── Number of subtables: 3 (typical)
│
├── Subtable 1: Platform 0, Encoding 3 (Unicode BMP)
│ ├── Format 4 (segment mapping)
│ └── Maps U+0000 to U+FFFF
│
├── Subtable 2: Platform 3, Encoding 1 (Windows Unicode BMP)
│ ├── Format 4 (segment mapping)
│ └── Maps U+0000 to U+FFFF (most common)
│
└── Subtable 3: Platform 3, Encoding 10 (Windows Unicode Full)
├── Format 12 (segmented coverage)
└── Maps U+0000 to U+10FFFF (supplementary planes)
Format 4 Example (Simplified):
Segments: [A-Z] → glyph 36-61, [a-z] → glyph 68-93
Entry: U+0041 ('A') → Glyph ID 36Common cmap Formats
Segment Mapping (BMP)
Most common format. Maps Unicode BMP (U+0000-FFFF) using segments for efficiency.
Trimmed Table
Simple array mapping for contiguous character range. Less common.
Segmented Coverage
Required for supplementary Unicode planes (emoji, historic scripts, etc.)
Format 4 vs Format 12: When Each Is Used
Format 4 uses a segment-based encoding that is efficient for the Unicode Basic Multilingual Plane (BMP, U+0000–U+FFFF) but cannot address supplementary planes. A font that only covers Latin, Greek, Cyrillic, Arabic, and most other modern scripts can rely on Format 4 alone. Format 12 is required whenever a font includes any character above U+FFFF—this includes emoji (U+1F600–U+1F64F), mathematical symbols (U+1D400–U+1D7FF), historic scripts, and rare CJK extension blocks.
Supplementary Characters (require Format 12)
- • Emoji: U+1F300–U+1FAFF (most emoji blocks)
- • Mathematical Alphanumeric Symbols: U+1D400–U+1D7FF
- • CJK Extension B–F: U+20000–U+2FA1F
- • Cuneiform, Egyptian Hieroglyphs: U+12000+
- • Linear A, Linear B: U+10000–U+1007F
Dual Subtable Strategy
Modern fonts typically include both Format 4 and Format 12 subtables. The text engine selects the appropriate subtable:
- • BMP characters → Format 4 (faster lookup)
- • Supplementary characters → Format 12
- • Format 12 is a superset — it can map all of Format 4's range too
# Check cmap subtables in a font using fontTools
from fontTools.ttLib import TTFont
font = TTFont('myfont.ttf')
cmap = font['cmap']
for subtable in cmap.tables:
print(f"Platform {subtable.platformID}, "
f"Encoding {subtable.platEncID}, "
f"Format {subtable.format}")
# Typical output for a modern font:
# Platform 0, Encoding 3, Format 4 (Unicode BMP)
# Platform 0, Encoding 4, Format 12 (Unicode Full)
# Platform 3, Encoding 1, Format 4 (Windows Unicode BMP)
# Platform 3, Encoding 10, Format 12 (Windows Unicode Full)CSS unicode-range Optimization
The unicode-range descriptor in @font-face tells browsers which characters a font file contains. Browsers only download fonts when the page includes characters in that range.
/* Split font into range-specific files */
@font-face {
font-family: 'MyFont';
src: url('myfont-latin.woff2') format('woff2');
unicode-range: U+0000-00FF; /* Basic Latin */
}
@font-face {
font-family: 'MyFont';
src: url('myfont-latin-ext.woff2') format('woff2');
unicode-range: U+0100-024F; /* Latin Extended */
}
@font-face {
font-family: 'MyFont';
src: url('myfont-greek.woff2') format('woff2');
unicode-range: U+0370-03FF; /* Greek */
}
@font-face {
font-family: 'MyFont';
src: url('myfont-cyrillic.woff2') format('woff2');
unicode-range: U+0400-04FF; /* Cyrillic */
}
/* Common Unicode range values */
U+0000-00FF /* Basic Latin + Latin-1 Supplement */
U+0100-017F /* Latin Extended-A */
U+0180-024F /* Latin Extended-B */
U+0250-02AF /* IPA Extensions */
U+0370-03FF /* Greek and Coptic */
U+0400-04FF /* Cyrillic */
U+4E00-9FFF /* CJK Unified Ideographs */
U+1F600-1F64F /* Emoticons (Emoji) */Performance Benefit
An English-only page using a font with unicode-range declared will only download the Basic Latin subset. Greek, Cyrillic, and other files are never requested. This can reduce font downloads from hundreds of KB to under 30KB.
How Browsers Evaluate unicode-range
When a browser encounters multiple @font-face blocks with the same font-family and different unicode-range values, it scans the document's text content and maps each character to its code point. For each character, it checks which font file's unicode-range includes that code point, then downloads only those files. This check happens before any font is downloaded — it's a pure code point range comparison against the document text.
Wildcard syntax: U+26?? matches U+2600–U+26FF (Miscellaneous Symbols block). The ? substitutes any hex digit.
Comma-separated ranges: U+0000-00FF, U+2000-206F combines multiple ranges in a single descriptor.
No unicode-range = download always: Omitting the descriptor means the browser always downloads the font file, regardless of page content. This is the default behavior.
Unicode Ranges and Subsetting
Subsetting removes glyphs from a font, and the cmap table is updated to reflect only the characters that remain. Understanding this helps you subset correctly without losing needed characters.
Subsetting by Unicode Range
# Basic Latin only pyftsubset font.ttf --unicodes="U+0000-00FF" --output-file=latin.woff2 # Latin + Latin Extended pyftsubset font.ttf --unicodes="U+0000-024F" --output-file=latin-all.woff2 # Multiple specific ranges pyftsubset font.ttf \ --unicodes="U+0000-00FF,U+0100-017F,U+2000-206F" \ --output-file=latin-plus.woff2 # From a text file (characters actually used) pyftsubset font.ttf --text-file=content.txt --output-file=custom.woff2
After subsetting, the cmap only includes mappings for retained glyphs. The font physically cannot render characters that were removed.
Common Subsetting Mistake
Don't forget punctuation and symbols when subsetting. Basic Latin (U+0020-007E) misses common characters like curly quotes (“ ”), em dashes (—), and the euro sign (€). Include U+2000-206F (General Punctuation) and U+20AC (Euro) for typical web content.
Recommended Unicode Ranges for Common Use Cases
| Use Case | Recommended Ranges | Characters Covered |
|---|---|---|
| English-only site | U+0020-007E, U+2018-201D, U+2013-2014 | ASCII + curly quotes + dashes |
| Western European | U+0020-024F, U+2000-206F, U+20AC | Latin + Extended A/B + punctuation + euro |
| Pan-European | U+0020-04FF, U+2000-206F, U+20AC | Latin + Greek + Cyrillic + punctuation |
| E-commerce (numbers, currency) | U+0030-0039, U+0024, U+20AC, U+00A3, U+00A5 | Digits + $, €, £, ¥ currency symbols |
CJK Font Strategies
Chinese, Japanese, and Korean fonts present unique challenges due to their massive character sets (20,000+ glyphs). Unoptimized CJK fonts can exceed 10MB.
Strategy 1: Unicode Range Splitting
Split font into many small files by Unicode block. Browsers download only needed blocks.
Google Fonts uses ~100 subsets for Noto Sans CJK
Strategy 2: Dynamic Subsetting
Generate subsets on-demand containing only characters used on each page.
Requires server-side infrastructure
Strategy 3: System Font Fallback
Use web fonts for Latin, fall back to system fonts for CJK characters.
Most pragmatic for many sites
Strategy 4: Content-Based Subset
Analyze your content and create a custom subset of actually-used characters.
Good for static sites with known content
Optimize Your Font Character Coverage
Convert and subset fonts with precise Unicode range control.
Try Font ConverterWritten by
Sarah Mitchell
Product Designer, Font Specialist
Verified by
Marcus Rodriguez
Lead Developer
Unicode Ranges FAQs
Common questions about cmap tables and character mapping
