Font Converter

Unicode Ranges and cmap Tables

Understand how fonts map Unicode code points to glyphs, how unicode-range optimizes web font loading, and how subsetting affects character coverage.

Key Takeaways

  • • The cmap table maps Unicode code points to glyph IDs
  • • CSS unicode-range enables selective font loading by browsers
  • • Subsetting removes characters and updates cmap accordingly
  • • CJK fonts benefit greatly from unicode-range splitting

Every font needs a way to connect typed characters to visual glyphs. The cmap (character map) table is this bridge—it tells the text engine which glyph to display for each Unicode code point. Understanding cmap and Unicode ranges helps you optimize fonts for web delivery and troubleshoot rendering issues.

Unicode ranges also matter for web performance. The CSS unicode-range descriptor lets browsers download font files only when the page actually contains characters in that range. For multilingual sites or fonts with large character sets, this dramatically reduces unnecessary downloads.

The cmap Table Structure

The cmap table contains one or more subtables, each supporting a different platform/encoding combination. Modern fonts typically include:

cmap Table Structure
├── Header
│   ├── Version: 0
│   └── Number of subtables: 3 (typical)
│
├── Subtable 1: Platform 0, Encoding 3 (Unicode BMP)
│   ├── Format 4 (segment mapping)
│   └── Maps U+0000 to U+FFFF
│
├── Subtable 2: Platform 3, Encoding 1 (Windows Unicode BMP)
│   ├── Format 4 (segment mapping)
│   └── Maps U+0000 to U+FFFF (most common)
│
└── Subtable 3: Platform 3, Encoding 10 (Windows Unicode Full)
    ├── Format 12 (segmented coverage)
    └── Maps U+0000 to U+10FFFF (supplementary planes)

Format 4 Example (Simplified):
Segments: [A-Z] → glyph 36-61, [a-z] → glyph 68-93
Entry: U+0041 ('A') → Glyph ID 36

Common cmap Formats

4

Segment Mapping (BMP)

Most common format. Maps Unicode BMP (U+0000-FFFF) using segments for efficiency.

6

Trimmed Table

Simple array mapping for contiguous character range. Less common.

12

Segmented Coverage

Required for supplementary Unicode planes (emoji, historic scripts, etc.)

Format 4 vs Format 12: When Each Is Used

Format 4 uses a segment-based encoding that is efficient for the Unicode Basic Multilingual Plane (BMP, U+0000–U+FFFF) but cannot address supplementary planes. A font that only covers Latin, Greek, Cyrillic, Arabic, and most other modern scripts can rely on Format 4 alone. Format 12 is required whenever a font includes any character above U+FFFF—this includes emoji (U+1F600–U+1F64F), mathematical symbols (U+1D400–U+1D7FF), historic scripts, and rare CJK extension blocks.

Supplementary Characters (require Format 12)

  • • Emoji: U+1F300–U+1FAFF (most emoji blocks)
  • • Mathematical Alphanumeric Symbols: U+1D400–U+1D7FF
  • • CJK Extension B–F: U+20000–U+2FA1F
  • • Cuneiform, Egyptian Hieroglyphs: U+12000+
  • • Linear A, Linear B: U+10000–U+1007F

Dual Subtable Strategy

Modern fonts typically include both Format 4 and Format 12 subtables. The text engine selects the appropriate subtable:

  • • BMP characters → Format 4 (faster lookup)
  • • Supplementary characters → Format 12
  • • Format 12 is a superset — it can map all of Format 4's range too
# Check cmap subtables in a font using fontTools
from fontTools.ttLib import TTFont
font = TTFont('myfont.ttf')
cmap = font['cmap']
for subtable in cmap.tables:
    print(f"Platform {subtable.platformID}, "
          f"Encoding {subtable.platEncID}, "
          f"Format {subtable.format}")
# Typical output for a modern font:
# Platform 0, Encoding 3, Format 4   (Unicode BMP)
# Platform 0, Encoding 4, Format 12  (Unicode Full)
# Platform 3, Encoding 1, Format 4   (Windows Unicode BMP)
# Platform 3, Encoding 10, Format 12 (Windows Unicode Full)

CSS unicode-range Optimization

The unicode-range descriptor in @font-face tells browsers which characters a font file contains. Browsers only download fonts when the page includes characters in that range.

/* Split font into range-specific files */
@font-face {
  font-family: 'MyFont';
  src: url('myfont-latin.woff2') format('woff2');
  unicode-range: U+0000-00FF; /* Basic Latin */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-latin-ext.woff2') format('woff2');
  unicode-range: U+0100-024F; /* Latin Extended */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-greek.woff2') format('woff2');
  unicode-range: U+0370-03FF; /* Greek */
}

@font-face {
  font-family: 'MyFont';
  src: url('myfont-cyrillic.woff2') format('woff2');
  unicode-range: U+0400-04FF; /* Cyrillic */
}

/* Common Unicode range values */
U+0000-00FF    /* Basic Latin + Latin-1 Supplement */
U+0100-017F    /* Latin Extended-A */
U+0180-024F    /* Latin Extended-B */
U+0250-02AF    /* IPA Extensions */
U+0370-03FF    /* Greek and Coptic */
U+0400-04FF    /* Cyrillic */
U+4E00-9FFF    /* CJK Unified Ideographs */
U+1F600-1F64F  /* Emoticons (Emoji) */

Performance Benefit

An English-only page using a font with unicode-range declared will only download the Basic Latin subset. Greek, Cyrillic, and other files are never requested. This can reduce font downloads from hundreds of KB to under 30KB.

How Browsers Evaluate unicode-range

When a browser encounters multiple @font-face blocks with the same font-family and different unicode-range values, it scans the document's text content and maps each character to its code point. For each character, it checks which font file's unicode-range includes that code point, then downloads only those files. This check happens before any font is downloaded — it's a pure code point range comparison against the document text.

Wildcard syntax: U+26?? matches U+2600–U+26FF (Miscellaneous Symbols block). The ? substitutes any hex digit.

Comma-separated ranges: U+0000-00FF, U+2000-206F combines multiple ranges in a single descriptor.

No unicode-range = download always: Omitting the descriptor means the browser always downloads the font file, regardless of page content. This is the default behavior.

Unicode Ranges and Subsetting

Subsetting removes glyphs from a font, and the cmap table is updated to reflect only the characters that remain. Understanding this helps you subset correctly without losing needed characters.

Subsetting by Unicode Range

# Basic Latin only
pyftsubset font.ttf --unicodes="U+0000-00FF" --output-file=latin.woff2

# Latin + Latin Extended
pyftsubset font.ttf --unicodes="U+0000-024F" --output-file=latin-all.woff2

# Multiple specific ranges
pyftsubset font.ttf \
  --unicodes="U+0000-00FF,U+0100-017F,U+2000-206F" \
  --output-file=latin-plus.woff2

# From a text file (characters actually used)
pyftsubset font.ttf --text-file=content.txt --output-file=custom.woff2

After subsetting, the cmap only includes mappings for retained glyphs. The font physically cannot render characters that were removed.

Common Subsetting Mistake

Don't forget punctuation and symbols when subsetting. Basic Latin (U+0020-007E) misses common characters like curly quotes (“ ”), em dashes (—), and the euro sign (€). Include U+2000-206F (General Punctuation) and U+20AC (Euro) for typical web content.

Recommended Unicode Ranges for Common Use Cases

Use CaseRecommended RangesCharacters Covered
English-only siteU+0020-007E, U+2018-201D, U+2013-2014ASCII + curly quotes + dashes
Western EuropeanU+0020-024F, U+2000-206F, U+20ACLatin + Extended A/B + punctuation + euro
Pan-EuropeanU+0020-04FF, U+2000-206F, U+20ACLatin + Greek + Cyrillic + punctuation
E-commerce (numbers, currency)U+0030-0039, U+0024, U+20AC, U+00A3, U+00A5Digits + $, €, £, ¥ currency symbols

CJK Font Strategies

Chinese, Japanese, and Korean fonts present unique challenges due to their massive character sets (20,000+ glyphs). Unoptimized CJK fonts can exceed 10MB.

Strategy 1: Unicode Range Splitting

Split font into many small files by Unicode block. Browsers download only needed blocks.

Google Fonts uses ~100 subsets for Noto Sans CJK

Strategy 2: Dynamic Subsetting

Generate subsets on-demand containing only characters used on each page.

Requires server-side infrastructure

Strategy 3: System Font Fallback

Use web fonts for Latin, fall back to system fonts for CJK characters.

Most pragmatic for many sites

Strategy 4: Content-Based Subset

Analyze your content and create a custom subset of actually-used characters.

Good for static sites with known content

Optimize Your Font Character Coverage

Convert and subset fonts with precise Unicode range control.

Try Font Converter
Sarah Mitchell

Written by

Sarah Mitchell

Product Designer, Font Specialist

Marcus Rodriguez

Verified by

Marcus Rodriguez

Lead Developer

Unicode Ranges FAQs

Common questions about cmap tables and character mapping