What is a cmap table and how does it work?

The cmap (character to glyph mapping) table maps Unicode code points to glyph IDs in the font. It contains multiple subtables for different platform/encoding combinations (Windows Unicode BMP, Windows Unicode full, Macintosh). When you type 'A' (U+0041), the cmap tells the renderer which glyph to use. Without proper cmap entries, characters won't display.

How does unicode-range work in CSS @font-face?

The unicode-range descriptor tells browsers which characters a font file covers. Browsers only download the font if the page contains characters in that range. For example: unicode-range: U+0000-00FF covers Basic Latin. This enables efficient font loading for multilingual sites by splitting fonts into range-specific files that load on demand.

What happens to Unicode coverage when subsetting fonts?

Subsetting removes glyphs not in your specified character set, which also updates the cmap table to remove mappings for removed characters. The font's Unicode coverage shrinks to match retained characters. Always verify your subset includes all characters your content needs, including punctuation, symbols, and any special characters.

How do I handle CJK (Chinese, Japanese, Korean) fonts efficiently?

CJK fonts contain thousands of glyphs and can exceed 10MB. Strategies: 1) Use unicode-range to split into smaller files by character range. 2) Use dynamic subsetting services that generate subsets on-demand. 3) Subset to only the characters actually used on your site. 4) Consider system fonts for CJK text as fallback. Google Fonts automatically splits CJK fonts into ~100 subsets.

Why are some characters showing as boxes or question marks?

This indicates the font doesn't contain glyphs for those Unicode code points. Causes: 1) The font genuinely doesn't support those characters. 2) Subsetting removed needed characters. 3) The cmap table is missing entries. 4) Encoding mismatch between content and font. Check font coverage with FontDrop or similar tools and ensure your content encoding is UTF-8.

Unicode Ranges and cmap Tables

Understand how fonts map Unicode code points to glyphs, how unicode-range optimizes web font loading, and how subsetting affects character coverage.

Key Takeaways

• The cmap table maps Unicode code points to glyph IDs
• CSS unicode-range enables selective font loading by browsers
• Subsetting removes characters and updates cmap accordingly
• CJK fonts benefit greatly from unicode-range splitting

In this article

Every font needs a way to connect typed characters to visual glyphs. The cmap (character map) table is this bridge—it tells the text engine which glyph to display for each Unicode code point. Understanding cmap and Unicode ranges helps you optimize fonts for web delivery and troubleshoot rendering issues.

Unicode ranges also matter for web performance. The CSS unicode-range descriptor lets browsers download font files only when the page actually contains characters in that range. For multilingual sites or fonts with large character sets, this dramatically reduces unnecessary downloads.

The cmap Table Structure

The cmap table contains one or more subtables, each supporting a different platform/encoding combination. Modern fonts typically include:

cmap Table Structure
├── Header
│   ├── Version: 0
│   └── Number of subtables: 3 (typical)
│
├── Subtable 1: Platform 0, Encoding 3 (Unicode BMP)
│   ├── Format 4 (segment mapping)
│   └── Maps U+0000 to U+FFFF
│
├── Subtable 2: Platform 3, Encoding 1 (Windows Unicode BMP)
│   ├── Format 4 (segment mapping)
│   └── Maps U+0000 to U+FFFF (most common)
│
└── Subtable 3: Platform 3, Encoding 10 (Windows Unicode Full)
    ├── Format 12 (segmented coverage)
    └── Maps U+0000 to U+10FFFF (supplementary planes)

Format 4 Example (Simplified):
Segments: [A-Z] → glyph 36-61, [a-z] → glyph 68-93
Entry: U+0041 ('A') → Glyph ID 36

Common cmap Formats

Segment Mapping (BMP)

Most common format. Maps Unicode BMP (U+0000-FFFF) using segments for efficiency.

Trimmed Table

Simple array mapping for contiguous character range. Less common.

Segmented Coverage

Required for supplementary Unicode planes (emoji, historic scripts, etc.)

Format 4 vs Format 12: When Each Is Used

Format 4 uses a segment-based encoding that is efficient for the Unicode Basic Multilingual Plane (BMP, U+0000–U+FFFF) but cannot address supplementary planes. A font that only covers Latin, Greek, Cyrillic, Arabic, and most other modern scripts can rely on Format 4 alone. Format 12 is required whenever a font includes any character above U+FFFF—this includes emoji (U+1F600–U+1F64F), mathematical symbols (U+1D400–U+1D7FF), historic scripts, and rare CJK extension blocks.

Supplementary Characters (require Format 12)

• Emoji: U+1F300–U+1FAFF (most emoji blocks)
• Mathematical Alphanumeric Symbols: U+1D400–U+1D7FF
• CJK Extension B–F: U+20000–U+2FA1F
• Cuneiform, Egyptian Hieroglyphs: U+12000+
• Linear A, Linear B: U+10000–U+1007F

Dual Subtable Strategy

Modern fonts typically include both Format 4 and Format 12 subtables. The text engine selects the appropriate subtable:

• BMP characters → Format 4 (faster lookup)
• Supplementary characters → Format 12
• Format 12 is a superset — it can map all of Format 4's range too

# Check cmap subtables in a font using fontTools
from fontTools.ttLib import TTFont
font = TTFont('myfont.ttf')
cmap = font['cmap']
for subtable in cmap.tables:
    print(f"Platform {subtable.platformID}, "
          f"Encoding {subtable.platEncID}, "
          f"Format {subtable.format}")
# Typical output for a modern font:
# Platform 0, Encoding 3, Format 4   (Unicode BMP)
# Platform 0, Encoding 4, Format 12  (Unicode Full)
# Platform 3, Encoding 1, Format 4   (Windows Unicode BMP)
# Platform 3, Encoding 10, Format 12 (Windows Unicode Full)

CSS unicode-range Optimization

The unicode-range descriptor in @font-face tells browsers which characters a font file contains. Browsers only download fonts when the page includes characters in that range.

/* Split font into range-specific files */
@font-face {
  font-family: 'MyFont';
  src: url('myfont-latin.woff2') format('woff2');
  unicode-range: U+0000-00FF; /* Basic Latin */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-latin-ext.woff2') format('woff2');
  unicode-range: U+0100-024F; /* Latin Extended */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-greek.woff2') format('woff2');
  unicode-range: U+0370-03FF; /* Greek */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-cyrillic.woff2') format('woff2');
  unicode-range: U+0400-04FF; /* Cyrillic */
}

/* Common Unicode range values */
U+0000-00FF    /* Basic Latin + Latin-1 Supplement */
U+0100-017F    /* Latin Extended-A */
U+0180-024F    /* Latin Extended-B */
U+0250-02AF    /* IPA Extensions */
U+0370-03FF    /* Greek and Coptic */
U+0400-04FF    /* Cyrillic */
U+4E00-9FFF    /* CJK Unified Ideographs */
U+1F600-1F64F  /* Emoticons (Emoji) */

Performance Benefit

An English-only page using a font with unicode-range declared will only download the Basic Latin subset. Greek, Cyrillic, and other files are never requested. This can reduce font downloads from hundreds of KB to under 30KB.

How Browsers Evaluate unicode-range

When a browser encounters multiple @font-face blocks with the same font-family and different unicode-range values, it scans the document's text content and maps each character to its code point. For each character, it checks which font file's unicode-range includes that code point, then downloads only those files. This check happens before any font is downloaded — it's a pure code point range comparison against the document text.

→

Wildcard syntax: U+26?? matches U+2600–U+26FF (Miscellaneous Symbols block). The ? substitutes any hex digit.

→

Comma-separated ranges: U+0000-00FF, U+2000-206F combines multiple ranges in a single descriptor.

→

No unicode-range = download always: Omitting the descriptor means the browser always downloads the font file, regardless of page content. This is the default behavior.

Unicode Ranges and Subsetting

Subsetting removes glyphs from a font, and the cmap table is updated to reflect only the characters that remain. Understanding this helps you subset correctly without losing needed characters.

Subsetting by Unicode Range

# Basic Latin only
pyftsubset font.ttf --unicodes="U+0000-00FF" --output-file=latin.woff2

# Latin + Latin Extended
pyftsubset font.ttf --unicodes="U+0000-024F" --output-file=latin-all.woff2

# Multiple specific ranges
pyftsubset font.ttf \
  --unicodes="U+0000-00FF,U+0100-017F,U+2000-206F" \
  --output-file=latin-plus.woff2

# From a text file (characters actually used)
pyftsubset font.ttf --text-file=content.txt --output-file=custom.woff2

After subsetting, the cmap only includes mappings for retained glyphs. The font physically cannot render characters that were removed.

Common Subsetting Mistake

Don't forget punctuation and symbols when subsetting. Basic Latin (U+0020-007E) misses common characters like curly quotes (“ ”), em dashes (—), and the euro sign (€). Include U+2000-206F (General Punctuation) and U+20AC (Euro) for typical web content.

Recommended Unicode Ranges for Common Use Cases

Use Case	Recommended Ranges	Characters Covered
English-only site	U+0020-007E, U+2018-201D, U+2013-2014	ASCII + curly quotes + dashes
Western European	U+0020-024F, U+2000-206F, U+20AC	Latin + Extended A/B + punctuation + euro
Pan-European	U+0020-04FF, U+2000-206F, U+20AC	Latin + Greek + Cyrillic + punctuation
E-commerce (numbers, currency)	U+0030-0039, U+0024, U+20AC, U+00A3, U+00A5	Digits + $, €, £, ¥ currency symbols

CJK Font Strategies

Chinese, Japanese, and Korean fonts present unique challenges due to their massive character sets (20,000+ glyphs). Unoptimized CJK fonts can exceed 10MB.

Strategy 1: Unicode Range Splitting

Split font into many small files by Unicode block. Browsers download only needed blocks.

Google Fonts uses ~100 subsets for Noto Sans CJK

Strategy 2: Dynamic Subsetting

Generate subsets on-demand containing only characters used on each page.

Requires server-side infrastructure

Strategy 3: System Font Fallback

Use web fonts for Latin, fall back to system fonts for CJK characters.

Most pragmatic for many sites

Strategy 4: Content-Based Subset

Analyze your content and create a custom subset of actually-used characters.

Good for static sites with known content

Optimize Your Font Character Coverage

Convert and subset fonts with precise Unicode range control.

Try Font Converter

Written by

Sarah Mitchell

Product Designer, Font Specialist

Verified by

Marcus Rodriguez

Lead Developer

Unicode Ranges FAQs

Common questions about cmap tables and character mapping