l10n-expansion-data.
An open localization dataset that estimates text expansion ratios by language pair, source-length bucket, and statistical confidence level.
Localized UI often fails because teams guess how much translated text will expand.
Buttons, navigation labels, dialogs, and dense product surfaces can break after translation if the design system has no quantitative expansion model.
l10n-expansion-data packages text expansion statistics as a practical engineering asset instead of a one-off spreadsheet or rule of thumb.
The dataset turns parallel corpora into lightweight runtime and analysis artifacts.
- C1Corpus source
OPUS-100 parallel sentence pairs provide the basis for language-pair expansion measurements.
- C2Length buckets
Source strings are grouped by length so short UI labels and longer body text are not forced into one average.
- C3Multi-format export
JSON, CSV, and YAML outputs make the data easy to consume from front-end code, backend services, and analysis tools.
The project is designed as a reusable data product, not just a research artifact.
Publishing aggregate metadata keeps the package useful while avoiding raw text redistribution concerns.
Short strings expand differently from long strings, so bucketed statistics better match UI risk.
Runtime estimators can use small single-value maps, while design tooling can use richer percentile data.
The dataset supports practical i18n decisions before layout bugs appear.
Layout planning
Design systems can reserve space based on measured expansion behavior instead of intuition.
QA integration
Localization tools can flag strings likely to overflow before they reach manual review.
Risk profiles
Mean, median, percentile, and range values let teams choose conservative or compact layout strategies.
Reusable license
CC0 licensing makes the derived statistics easy to integrate into open source and commercial workflows.