Back to work / Case study

l10n-expansion-data.

An open localization dataset that estimates text expansion ratios by language pair, source-length bucket, and statistical confidence level.

RoleData Product / i18n / Tooling

Year2026

PlatformDataset / Package

Status Open data

Export formats

Data strategies

OPUS-100

Corpus source

CC0

License

Localized UI often fails because teams guess how much translated text will expand.

Buttons, navigation labels, dialogs, and dense product surfaces can break after translation if the design system has no quantitative expansion model.

l10n-expansion-data packages text expansion statistics as a practical engineering asset instead of a one-off spreadsheet or rule of thumb.

C1Corpus source
OPUS-100 parallel sentence pairs provide the basis for language-pair expansion measurements.
C2Length buckets
Source strings are grouped by length so short UI labels and longer body text are not forced into one average.
C3Multi-format export
JSON, CSV, and YAML outputs make the data easy to consume from front-end code, backend services, and analysis tools.

Statistics onlyover Raw corpus redistribution

Publishing aggregate metadata keeps the package useful while avoiding raw text redistribution concerns.

Bucketed ratiosover One global multiplier

Short strings expand differently from long strings, so bucketed statistics better match UI risk.

Simple and detailed outputsover One heavy schema

Runtime estimators can use small single-value maps, while design tooling can use richer percentile data.

Design systems can reserve space based on measured expansion behavior instead of intuition.

Localization tools can flag strings likely to overflow before they reach manual review.

Mean, median, percentile, and range values let teams choose conservative or compact layout strategies.

CC0 licensing makes the derived statistics easy to integrate into open source and commercial workflows.

Next case study

2026 / Product site and release distribution