Back to work / Case study

l10n-expansion-data.

An open localization dataset that estimates text expansion ratios by language pair, source-length bucket, and statistical confidence level.

RoleData Product / i18n / Tooling
Year2026
PlatformDataset / Package
Status Open data
3
Export formats
2
Data strategies
OPUS-100
Corpus source
CC0
License

Localized UI often fails because teams guess how much translated text will expand.

Buttons, navigation labels, dialogs, and dense product surfaces can break after translation if the design system has no quantitative expansion model.

l10n-expansion-data packages text expansion statistics as a practical engineering asset instead of a one-off spreadsheet or rule of thumb.

The dataset turns parallel corpora into lightweight runtime and analysis artifacts.

  1. C1Corpus source

    OPUS-100 parallel sentence pairs provide the basis for language-pair expansion measurements.

  2. C2Length buckets

    Source strings are grouped by length so short UI labels and longer body text are not forced into one average.

  3. C3Multi-format export

    JSON, CSV, and YAML outputs make the data easy to consume from front-end code, backend services, and analysis tools.

The project is designed as a reusable data product, not just a research artifact.

Statistics onlyover Raw corpus redistribution

Publishing aggregate metadata keeps the package useful while avoiding raw text redistribution concerns.

Bucketed ratiosover One global multiplier

Short strings expand differently from long strings, so bucketed statistics better match UI risk.

Simple and detailed outputsover One heavy schema

Runtime estimators can use small single-value maps, while design tooling can use richer percentile data.

The dataset supports practical i18n decisions before layout bugs appear.

Layout planning

Design systems can reserve space based on measured expansion behavior instead of intuition.

QA integration

Localization tools can flag strings likely to overflow before they reach manual review.

Risk profiles

Mean, median, percentile, and range values let teams choose conservative or compact layout strategies.

Reusable license

CC0 licensing makes the derived statistics easy to integrate into open source and commercial workflows.

The natural extension is connecting the dataset directly to localization tooling.

  • - Expose helper functions for common UI expansion estimates
  • - Integrate with LexiSync QA checks
  • - Add examples for design systems and front-end layout tests
Next case study
HistorySyncWeb
2026 / Product site and release distribution