CLDR Specification Conformance

Copy Markdown View Source

THe implementation fully meets the TR35-8 PersonNames specification but there are some differences in test results for a few locales.

This document describes cases where the CLDR person name test data produces results that are inconsistent with the specification text in TR35-8 PersonNames. These are filed as potential specification or test data bugs.

1. Format Selection Tiebreaker Inconsistency (es_US, es_MX, es_419)

Affected locales

es_US, es_MX, es_419

Summary

When two namePatterns have equal numbers of populated fields, the spec says to select the pattern with the fewest unpopulated fields. The CLDR test data expects the pattern with MORE unpopulated fields to be selected in certain cases, contradicting the spec.

Specification text

From Choose a namePattern:

  1. Find the set of patterns with the most populated fields.
  2. If there is just one element in that set, use it.
  3. Otherwise, among that set, find the set of patterns with the fewest unpopulated fields.
  4. If there is just one element in that set, use it.
  5. Otherwise, take the pattern that is alphabetically least.

How to reproduce

Locale es_MX, format parameters: order: sorting, length: long, usage: referring, formality: informal.

The locale provides two sorting patterns for this combination:

Pattern A: {surname} {surname2}, {given-informal}    (3 fields)
Pattern B: {surname}, {given-informal}               (2 fields)

Input PersonName: given: "Käthe", surname: "Müller" (no surname2).

Scoring:

PatternTotal fieldsPopulatedUnpopulated
A321
B220

Per the spec, both have 2 populated fields (tied at step 1). At step 3, Pattern B has fewer unpopulated fields (0 vs 1), so Pattern B should be selected.

Pattern B produces: "Müller, Käthe" (with comma separator).

Expected result (from CLDR test data)

"Müller Käthe" (no comma)

This is the result of Pattern A, where the empty {surname2} field and its surrounding comma are removed by the empty field processing rules, leaving "Müller Käthe".

Actual result (per specification algorithm)

"Müller, Käthe" (with comma)

Pattern B is selected because it has fewer unpopulated fields. Its comma separator is between two populated fields and is preserved.

Analysis

The CLDR test data expects Pattern A to be selected, despite Pattern B having strictly fewer unpopulated fields. This suggests the ICU reference implementation uses a different format selection algorithm than what the specification describes.

Possible interpretations:

  • The ICU implementation may prefer document order (first pattern) over the "fewest unpopulated" criterion when populated counts are equal.

  • The ICU implementation may not implement step 3 at all, effectively using only step 1 (most populated) followed by document order or alphabetical tiebreaker.

  • The spec text may need to be updated to reflect the actual ICU behavior.

The same pattern appears across 28 test cases in es_US, es_MX, and es_419, all involving sorting-order formats where one pattern includes {surname2} and the other does not.

Test data references

  • es_MX.txt line 645: parameters; sorting; long; referring; informal → expected "Müller Käthe"
  • es_US.txt line 124: parameters; sorting; medium; referring; formal → expected "García Pérez, Lucía"
  • es_419.txt line 134: parameters; sorting; long; referring; informal → expected "Jacobsen Christopher"

2. Empty Field Removal Drops Grouping Punctuation (cs, sk — now fixed)

Affected locales

cs, sk (previously failing, now fixed in our implementation)

Summary

The spec's empty field removal rules, as literally described, would drop grouping punctuation (opening parentheses) when a nil field precedes a parenthetical group. The spec describes removing a single empty field and then coalescing adjacent literals, but the coalescing rules can lose structural characters in certain literal combinations.

Specification text

From Process a namePattern:

  1. If there are two or more empty fields separated only by literals, the fields and the literals between them are removed.
  2. If there is a single empty field, it is removed.
  3. If the processing from step 3 results in two adjacent literals (call them A and B), they are coalesced into one literal as follows:
    • If either is empty the result is the other one.
    • If B matches the end of A, then the result is A.
    • Otherwise the result is A + B, further modified by replacing any sequence of two or more white space characters by the first whitespace character.

How to reproduce

Locale cs, format parameters: order: sorting, length: long, usage: referring, formality: formal.

The format pattern is:

{surname-core}, {given} {given2} {surname-prefix} ({title}, {credentials})

Input PersonName: title: "paní", given: "Alexandra", given2: "Zuzana", surname: "Machová", credentials: "Ph.D." (no surname-prefix).

After field interpolation, the relevant segment is:

..., " ", nil, " (", {:field, "paní"}, ...

Where nil is the empty {surname-prefix} field, " " is the literal before it, and " (" is the literal after it.

Expected result (from CLDR test data)

"Machová, Alexandra Zuzana (paní, Ph.D.)"

The opening parenthesis is preserved.

Result from literal spec application

Applying the spec literally to " ", nil, " (":

  1. Single empty field (nil) → remove it.
  2. Adjacent literals " " (A) and " (" (B) → coalesce.
  3. B (" (") does not match the end of A (" ").
  4. Result: A + B = " " + " (" = " (" → whitespace dedup → " (".

The coalescing rules actually DO produce the correct result (" (") in this specific case. However, most implementations handle [literal, nil, literal] by dropping one of the literals to handle the common separator case (like dropping ", " when the field it accompanies is absent), which inadvertently drops the " (".

Analysis

The spec's coalescing rules technically handle this case correctly, but the interaction between "remove separator literals for absent fields" and "preserve grouping punctuation" requires careful implementation. A naive implementation that drops the literal following a nil (to handle common separator patterns) will fail for grouping punctuation.

Our implementation resolves this by detecting grouping punctuation characters ((, ), [, ]) in the following literal and coalescing instead of dropping in those cases.

The spec could be improved by explicitly addressing how grouping punctuation should be preserved during empty field removal, or by providing test cases that exercise this interaction.

Test data references

  • cs.txt line 316: parameters; sorting; long; referring; formal → expected "Machová, Alexandra Zuzana (paní, Ph.D.)"
  • sk.txt line 307: parameters; sorting; long; referring; formal → expected "Machová, Alexandra Zuzana (paní, Ph.D.)"

3. Initial Derivation Requires UAX #29 Grapheme Clusters, Not Default Unicode Grapheme Clusters

Affected locales

kn (Kannada), km (Khmer), ml (Malayalam), si (Sinhala), my (Myanmar) — and potentially any locale using a complex Brahmic script.

Summary

The spec says initials are derived by taking the "first grapheme cluster" of each word, but does not specify which grapheme cluster algorithm to use. The CLDR test data expects results consistent with UAX #29 (Unicode Text Segmentation) grapheme cluster boundaries, which differ from the default Unicode grapheme cluster boundaries defined in UAX #44 (Unicode Character Database) and implemented by most standard library string functions.

The distinction matters for Brahmic scripts (Devanagari, Kannada, Khmer, Malayalam, Sinhala, Myanmar, etc.) where a virama (halant) character joins consonants into conjuncts. The two algorithms produce different first-grapheme results for conjunct consonants, and the CLDR test data is consistent only with the UAX #29 definition.

Specification text

From Derive Initials:

To derive an initial from a name field value, the first grapheme cluster is extracted.

The spec does not qualify which grapheme cluster definition to use.

The two grapheme cluster standards

Default (legacy) grapheme clusters are defined in UAX #44, Section 5.15. This is the algorithm used by most programming language standard libraries, including Erlang/OTP's string module (which underlies Elixir's String.first/1 and String.graphemes/1). In this algorithm, a virama (U+0CCD in Kannada, U+17D2 in Khmer, etc.) is a combining mark that joins with the preceding AND following characters into a single cluster.

Extended grapheme clusters are defined in UAX #29, Section 3.1. This algorithm includes script-specific tailoring rules via the Grapheme_Cluster_Break property. For Indic scripts, UAX #29 defines InCB (Indic Conjunct Break) properties that cause the virama to break the cluster at the conjunct boundary, producing shorter clusters.

How to reproduce

Kannada word ಕ್ಯಾಥಿ (transliterated "Kyāthi"), composed of the codepoints:

CodepointCharacterName
U+0C95KANNADA LETTER KA
U+0CCDKANNADA SIGN VIRAMA
U+0CAFKANNADA LETTER YA
U+0CBEKANNADA VOWEL SIGN AA
U+0CA5KANNADA LETTER THA
U+0CBFಿKANNADA VOWEL SIGN I

Default grapheme clusters (Erlang/Elixir String.graphemes/1):

["ಕ್ಯಾ", "ಥಿ"]    2 clusters

The virama (U+0CCD) joins KA + YA + AA into one large conjunct cluster.

UAX #29 extended grapheme clusters (ICU, Unicode.String.split(break: :grapheme)):

["ಕ್", "ಯಾ", "ಥಿ"]    3 clusters

The virama only joins with the preceding KA, and YA starts a new cluster.

Expected result (from CLDR test data)

The initial for ಕ್ಯಾಥಿ is ಕ್ (KA + VIRAMA, 2 codepoints) — the first UAX #29 extended grapheme cluster.

Actual result using default grapheme clusters

The initial is ಕ್ಯಾ (KA + VIRAMA + YA + AA, 4 codepoints) — the first default grapheme cluster, which includes the full conjunct.

Further examples across scripts

ScriptWordDefault first clusterUAX #29 first clusterCLDR expected
Kannadaಕ್ಯಾಥಿಕ್ಯಾ (4 codepoints)ಕ್ (2 codepoints)ಕ್
Khmerហ្សាសាស្តូហ្ (2 codepoints)ហ្សា (4 codepoints)ហ្សា
Malayalamസ്‌റ്റോബർസ്‌റ്റോ (7 codepoints)സ്‌ (3 codepoints)സ്‌

Note that Khmer is the inverse case: the default cluster is too SHORT and the UAX #29 cluster is LONGER. This is because Khmer's coeng (U+17D2) has different Grapheme_Cluster_Break properties than Indic viramas, and UAX #29 correctly groups the subscript consonant with its base.

Analysis

The spec should explicitly state that initial derivation uses UAX #29 extended grapheme clusters, not default grapheme clusters. This distinction is critical because:

  1. Most standard library implementations use default clusters. Erlang/OTP, Go's unicode/utf8, Python's grapheme module, and many others implement the simpler default algorithm. Only libraries that specifically implement UAX #29 (such as ICU, Rust's unicode-segmentation crate, and the Elixir unicode_string package) produce the correct results.

  2. The difference is significant for Brahmic scripts. These scripts represent roughly 40% of the world's writing systems by user population (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, and others). An implementation using default grapheme clusters will produce incorrect initials for names in any of these scripts.

  3. The two algorithms diverge specifically at virama/halant boundaries. The virama character (or its equivalent: coeng in Khmer, asat in Myanmar, al-lakuna in Sinhala) is the point where the algorithms differ. UAX #29's InCB (Indic Conjunct Break) property tables define script-specific rules for how these characters interact with cluster boundaries.

Suggested spec clarification

The phrase "first grapheme cluster" in the initial derivation section should be amended to read:

To derive an initial from a name field value, the first extended grapheme cluster (as defined by UAX #29) is extracted.

Test data references

  • kn.txt line 522: initial for ಕ್ಯಾಥಿ → expected ಕ್.
  • km.txt line 444: initial for ហ្សាសាស្តូ → expected ហ្សា.
  • ml.txt line 440: initial for സ്‌റ്റോബർ → expected സ്‌.

4. Test Data / Locale Data Mismatch for yo_BJ (Yoruba-Benin)

Affected locales

yo_BJ

Summary

The yo_BJ test data expects initials (e.g., "O. Adeboye") for short/referring/formal formats, but the yo_BJ locale data contains no -initial modifier in those format patterns. The format pattern {given} {given2} {surname} {credentials} outputs full names, not initials. The parent locale yo has identical format patterns and its test data correctly expects full names for the same format combinations.

How to reproduce

Locale yo_BJ, format parameters: order: givenFirst, length: short, usage: referring, formality: formal.

The locale format pattern is:

{given} {given2} {surname} {credentials}

Input PersonName: given: "Olabisi", surname: "Adeboye", locale: yo_BJ.

Expected result (from CLDR test data)

"O. Adeboye" — initial + surname.

Actual result (per locale format data)

"Olabisi Adeboye" — full given name + surname, because the format pattern has no -initial modifier on the {given} field.

Comparison with parent locale yo

The yo locale has the identical format pattern for givenFirst/short/referring/formal:

{given} {given2} {surname} {credentials}

The yo test data expects "Olabisi Adeboye" (full name) for this same combination — and this passes correctly.

Analysis

The yo_BJ test data appears to have been generated against a different version of the yo_BJ locale data that included -initial modifiers in the short/formal format patterns (e.g., {given-initial} {given2-initial} {surname}). The current locale data does not include these modifiers, making the test expectations inconsistent with the format patterns.

This affects 27 test cases in yo_BJ.txt, all involving short/referring or medium/referring format combinations where initials are expected but not produced.

Test data references

  • yo_BJ.txt line 152: parameters; givenFirst; short; referring; formal → expected "O. Adeboye", actual "Olabisi Adeboye"
  • yo_BJ.txt line 252: parameters; sorting; short; referring; formal → expected "Akintola, A. A.", actual "Akintola, Adeolu Adegboyega"
  • Compare yo.txt line 143: parameters; givenFirst; short; referring; formal → expected "Olabisi Adeboye" (full name, matches format data)