Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there no Intl.Locale.prototype.variants? #900

Open
nnmrts opened this issue Jun 18, 2024 · 15 comments
Open

Why is there no Intl.Locale.prototype.variants? #900

nnmrts opened this issue Jun 18, 2024 · 15 comments
Labels
c: locale Component: locale identifiers s: discuss Status: TG2 must discuss to move forward

Comments

@nnmrts
Copy link

nnmrts commented Jun 18, 2024

Why is there no Intl.Locale.prototype.variants? There are getters for language, region and script but I saw no information about the reason variants is missing or shouldn't be there as well.

@sffc
Copy link
Contributor

sffc commented Jul 18, 2024

@zbraniecki Thoughts on this?

@sffc sffc added s: discuss Status: TG2 must discuss to move forward c: locale Component: locale identifiers labels Jul 18, 2024
@sffc sffc moved this to Other Issues in ECMA-402 Meeting Topics Aug 22, 2024
@sffc
Copy link
Contributor

sffc commented Nov 25, 2024

TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-11-25.md#why-is-there-no-intllocaleprototypevariants-900

There were questions about motivation (most use cases for variants are better served by a corresponding Unicode extension keyword), as well as the shape of this getter (does it return a list? is the list sorted? or does it return a string with multiple subtags?)

@sffc sffc added s: comment Status: more info is needed to move forward and removed s: discuss Status: TG2 must discuss to move forward labels Nov 25, 2024
@nnmrts
Copy link
Author

nnmrts commented Nov 26, 2024

I see, thanks for having the discussion!

Since when is the variants part in CLDR "deprecated" though?

I sadly can't remember my exact use-case and should have included it in my original post, but I think it was about two things:

  • Identifying different german othographies (variants 1901, 1996)
  • Identifying Pe̍h-ōe-jī orthography/romanization (variant pehoeji)

The latter got added to the IANA language subtag registry in March this year. I know that isn't CLDR, but I was under the impression that this file is the "source of truth" for registered language subtags used in CLDR and everything else.

I also don't see any kind of "deprecation" of variants here: https://www.unicode.org/reports/tr35

Regarding the type of an eventual variants, I don't see any issue with using an array or even a set here.

I don't know what

The Japanese one from one to two, it’s complicated

is referencing and I don't see the difficulty of parsing variants. They can only ever be 5-8 long alphanumeric strings and they can only be followed by extensions and private use tags, so what's wrong with .split("-")? 😆

While we're at it, I don't see a reason why extensions and private-uses also aren't getters, but I guess that's a different story.

@sffc
Copy link
Contributor

sffc commented Nov 28, 2024

A little more context: I think people on the call were referring to variants as "legacy" or "deprecated" because of the following issues:

  1. Some variants are better represented as extension keywords.
  2. LDML says that variants are supposed to be in alphabetical order, which doesn't make sense with certain IANA subtags
    • Example: "sl-IT-rozaj-biske-1994" would be canonicalized to something like "sl-IT-1994-biske-rozaj" even though the IANA subtag registry says it should be "rozaj-biske"

In other words, the comments from the discussion were based on the point of view that variants are basically a grab bag of things that would be better expressed as more specific locale extensions.

Personally, I still think variants are motivated because they remain the standard way of expressing orthographies. Something like el-polyton is a good, modern language identifier that I don't believe has another representation with extension keywords.

Regarding the type of an eventual variants, I don't see any issue with using an array or even a set here.

Returning an ECMAScript Set is an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).

@sffc sffc added s: discuss Status: TG2 must discuss to move forward and removed s: comment Status: more info is needed to move forward labels Nov 28, 2024
@sffc sffc moved this from Previously Discussed to Other Issues in ECMA-402 Meeting Topics Nov 28, 2024
@gibson042
Copy link
Contributor

Returning an ECMAScript Set is an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).

I don't think it would, because ECMAScript Set instances are deterministically ordered.

@nnmrts
Copy link
Author

nnmrts commented Dec 4, 2024

Yeah, I mainly suggested the Set because it follows the other rule of variants (or any "multi-tag"): uniqueness. But honestly, for most users it would probably be unexpected to get a Set here in comparison to the rest of JS.

@nnmrts
Copy link
Author

nnmrts commented Dec 4, 2024

Regarding the ordering of variants: I don't really think the array or set needs to be ordered in any specific way other than "the same as supplied". Both CLDR and IANA, if I understand correctly, just define a recommended or canonical way to order them in the context of language subtags, not in the context of JavaScript arrays. And AFAIK implementers need to be able to understand any ordering.

One could even argue that it's more expected if the ordering is the same as the user specified it, even if it's "wrong".

So in general, the ordering, of all things, shouldn't be the blocker here.

@aphillips
Copy link

BCP47 suggests that the order in the original language tag carries meaning, with earlier subtags subordinating later ones. Specifically, item 6 under section 4.1 (Choice of Language Tag) says:

       Use variant subtags sparingly and in the correct order.  Most
       variant subtags have one or more 'Prefix' fields in the registry
       that express the list of subtags with which they are appropriate.
       Variants SHOULD only be used with subtags that appear in one of
       these 'Prefix' fields.  If a variant lists a second variant in
       one of its 'Prefix' fields, the first variant SHOULD appear
       directly after the second variant in any language tag where both
       occur.  General purpose variants (those with no 'Prefix' fields
       at all) SHOULD appear after any other variant subtags.  Order any
       remaining variants by placing the most significant subtag first.
       If none of the subtags is more significant or no relationship can
       be determined, alphabetize the subtags.  Because variants are
       very specialized, using many of them together generally makes the
       tag so narrow as to override the additional precision gained.
       Putting the subtags into another order interferes with
       interoperability, as well as the overall interpretation of the
       tag.

This means that the order should be preserved when there are two or more (and presuming, for the moment, that the tag's author has paid attention to the details in the registry as well as the text just above).

Unicode/CLDR says some different things about the ordering. In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.

In either case, the original order affects tag matching using one of the fallback schemes, and so should probably be preserved by Intl.Locale against possible future need (a canonicalization operation can be applied separately)

@sffc
Copy link
Contributor

sffc commented Dec 4, 2024

Unfortunately I believe the ordering is one of the main issues that needs to be resolved. We have two specs, IETF and UTS 35, which disagree on the ordering (preserved or alphabetical). ECMA-402 mostly follows Unicode's reckoning of locale identifiers, so it would follow that variants should be alphabetical. However, variants are most useful in IETF's reckoning, where order is preserved.

What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?

@nnmrts
Copy link
Author

nnmrts commented Dec 5, 2024

What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?

The current reality in Chrome at least is this:

(new Intl.Locale("de-bcdefg-abcdefg-12345-1000000-1996")).toString() === "de-1000000-12345-1996-abcdefg-bcdefg"
(new Intl.Locale("sl-IT-rozaj-biske-1994")).toString() === "sl-IT-1994-biske-rozaj"

So it's basically just alphabetic with no special numeric handling ("1000000" ≺ "12345" ∧ "12345" ≺ "abcdefg" ∧ "abcdefg" ≺ "bcdefg").

I couldn't find anything about ordering here or anywhere else in ECMA402, so I guess Intl.Locale.prototype.toString does not (yet) define any ordering? Sorry if I have overlooked something.


Another resource that says basically the same as that BCP47 section and what @sffc has said: https://www.w3.org/International/questions/qa-choosing-language-tags#variants

Both, that BCP47 section and that W3 link, claim that the ordering of variants helps with interoperability but don't get more specific, so I'm really unsure if this is actually the case. Like, is there any application out there that would completely break down if I give it a sl-IT-1994-biske-rozaj instead of a sl-IT-rozaj-biske-1994?

Either way, I get that the ordering is important, within language tag strings. But this would be addressed by fixing Intl.Locale.prototype.toString, no? I personally still don't see how this is related to what Intl.Locale.prototype.variants in JavaScript should look like, since that would be ideally an array or set or whatever that one can then use to loop over or check for specific variants. The hierarchical nature of some variants doesn't and shouldn't, in my opinion, mean that any specific ordering is expected by the user in a JavaScript context.

const describeBookLanguage = (bookName, locale) => {
  const prefix = `${bookName} is written in`;

  let languageLabel = "Sanskrit";

  if (locale.language === "sa" && locale.variants.length !== 0) {

    // don't care about the order here
    if (locale.variants.includes("itihasa") {
      languageLabel = `Epic ${languageLabel}`;
    }
    
    if (locale.variants.includes("bauddha") {
      languageLabel = `Bhuddist Hybrid ${languageLabel}`;

      // "Bhuddist Hybrid Epic Sanskrit" is technically possible here but probably not real
    }
  }
  else if (locale.language === "cls") {
    languageLabel = `Classical ${languageLabel}`;
  }
  else if (locale.language === "vsn") {
    languageLabel = `Vedic ${languageLabel}`;
  }

  return `${prefix} ${languageLabel}.`;
}

Or is it expected that something like the below should also work?

const firstPart = "sl-IT";
const secondPart = "rozaj-biske-1994";

const localeString1 = `${firstPart}-${secondPart}`; // "sl-IT-rozaj-biske-1994";

const locale = new Intl.Locale(localeString1);

const localeString2 = locale.toString(); // "sl-IT-1994-biske-rozaj";

const variants = locale.variants; // ["1994", "biske", "rozaj"]

const localeString3 = `${firstPart}-${variants.join("-")}`; // "sl-IT-1994-biske-rozaj"

const allSame = localeString1 === localeString2 && localeString2 === localeString3; // false, but should this be true?

The below would also be an issue if one expects a specific ordering of variants, but again, I don't think that expectation exists.

const slovenianVariantDescriptionParts = new Map([
  ["rozaj", "Resian"],
  ["biske", ", San Giorgio dialect"],
  ["lipaw", ", Lipovaz dialect"],
  ["njiva", ", Gniva dialect"],
  ["osojs", ", Oseacco dialect"],
  ["solba", ", Stolvizza dialect"],
  ["1994", ", in standardized 1994 orthography"]
]);

const describeSlovenianLanguageUsed = (locale) => {
  if (locale.language !== "sl") {
    throw new Error("Not Slovenian");
  }

  if (locale.variants.length === 0 || !locale.variants.includes("rozaj")) {
    return "Slovenian";
  }

  return locale.variants
    .map(variant => slovenianVariantDescriptionParts.get(variant))
    .join("");

  // depending on the order of variants, this could result in:
  // - "Resian, San Giorgio dialect, in standardized 1994 orthography" ✅
  // - ", Gniva dialect, in standardized 1994 orthographyResian" ❌
  // - ", in standardized 1994 orthographyResian" ❌
  // - ", Stolvizza dialectResian" ❌
};

I also still don't agree with this sentiment:

In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.

Again, maybe I'm misunderstanding something, but en-basiceng, de-1901, sgn-ase-blasl, sa-itihasa and el-polyton all seem like valid and not too niche uses of variants. And even if these are considered niche or "specific", I don't think the goal of i18n/l10n is to only consider commonly used and general cases. 😜

I'm not saying this is the most important thing ever, but I also wouldn't disregard variants as something "deprecated" or "only useful in very specific applications, most of which have nothing to do with locales".


So all in all, I think the ordering of Intl.Locale.prototype.variants or rather Intl.Locale.prototype.getVariants simply shouldn't be defined as long as the ordering of variants in Intl.Locale.prototype.toString also isn't defined. In case it either is defined and I just overlooked it, or it absolutely needs to be specified, I think following suit wtih getTimeZones and getCollections would be fine, which would simply be alphabetic order (or "lexicographic code unit order" to be precise).

@nnmrts
Copy link
Author

nnmrts commented Dec 5, 2024

For reference, how the ordering issue has been "solved" in a past issue: #330 (comment)

@nnmrts
Copy link
Author

nnmrts commented Dec 5, 2024

Sorry for the triple comment but here a quote from UTS 35 which ECMA402 follows, as far as I understand now:

NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in Section 4.1 of BCP 47. Here are the considerations that lead to that decision:

  • The ordering in is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
  • Moreover, Section 4.5 states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
  • The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
  • Robust implementations will accept the variants in any order, just as they accept extensions in any order.
  • A canonical order allows for determination of identity of identifiers via string comparison.
  • The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
  • Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.

@gibson042
Copy link
Contributor

So all in all, I think the ordering of Intl.Locale.prototype.variants or rather Intl.Locale.prototype.getVariants simply shouldn't be defined as long as the ordering of variants in Intl.Locale.prototype.toString also isn't defined. In case it either is defined and I just overlooked it

For the record, it is defined:

  1. The Intl.Locale constructor returns a locale object whose [[Locale]] slot is set to the [[locale]] field of the Record returned from MakeLocaleRecord(tag, opt, localeExtensionKeys).
    1. MakeLocaleRecord returns a result Record whose [[locale]] field is the value returned from either CanonicalizeUnicodeLocaleId(locale) or InsertUnicodeExtensionAndCanonicalize(locale, attributes, keywords) (the latter of which always returns the result of an internal use of the former).
    2. CanonicalizeUnicodeLocaleId returns a String that starts with "the String value resulting from performing the algorithm to transform locale to canonical form per Unicode Technical Standard #35 Part 1 Core, Annex C LocaleId Canonicalization".
    3. UTS 35, as noted above, starts with a Canonicalizing Syntax step that includes "Put any variants into alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)".
  2. Intl.Locale.prototype.toString just returns the [canonicalized] contents of its receiver's [[Locale]] slot.

@sffc
Copy link
Contributor

sffc commented Dec 5, 2024

Ok, so .variants returning an array with the variants in UTS 35 order would be most consistent with the rest of the spec, and I think functions such as describeSlovenianLanguageUsed could reconstruct what they need from the variants, even if they are in UTS 35 order. Is that accurate?

@nnmrts
Copy link
Author

nnmrts commented Dec 6, 2024

@sffc That seems logical to me with the only addition that it might needs to be .getVariants instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: discuss Status: TG2 must discuss to move forward
Projects
Status: Other Issues
Development

No branches or pull requests

4 participants