Consider relaxing locale resolution for `Intl.Segmenter` #895

jedel1043 · 2024-06-01T01:09:25Z

Alternative name: Make %Intl.Segmenter%.[[AvailableLocales]] be the full set of all syntactically valid locales (with some caveats like canonicalization/extensions)

Rationale

While most Intl services do require passing a locale for a correct behaviour at runtime, the Segmenter service is in this weird position where it supports almost all locales you throw at it, and the provided locale is just used as a suggestion to segment certain special cases.

This apparently makes it difficult for libraries such as ICU4X to determine if a locale is on their list of [[AvailableLocales]] or not; in that case, only a couple of locales ("km", "lo", "my", "th") are "supported" in the sense that they load some amount of data for them on their data provider. However, the rest of locales are very much "supported", they just don't load locale specific data at runtime. (asking for @sffc's help to add more context about this)

What then? Well, if virtually all locales are "supported" by Segmenter, why not just consider all (see alternative name) syntactically valid locales as supported locales for that service? This would mean making APIs such as Intl.Segmenter.supportedLocalesOf always return everything, which doesn't sound too bad for a service that is basically a low level text processing utility.

The text was updated successfully, but these errors were encountered:

anba · 2024-06-01T08:00:59Z

Implementations return all locales supported by ICU4C, which seems like a reasonable thing to do, because there's at least some guarantee that segmentation works for these locales. Returning everything could give the false impression that any locale works here, including locales like Klingon (tlh), Egyptian (egy), Akkadian (akk), etc.

eemeli · 2024-06-01T08:15:39Z

One option would be to return an explicit und for locales that are supported, but for which no additional data is needed.

sffc · 2024-06-01T08:15:52Z

Text processing utilities, including Segmenter and Collator, work based on scripts and properties more than locales. It doesn't make a whole lot of sense to ask a Segmenter or a Collator "what locales do you support", because they support all locales written in scripts that are encoded in Unicode.

It's a known issue that Segmenter favors majority languages in scripts over minority languages written in the same script (such as Cantonese (yue)). However, CLDR has data for yue in other services, and both Firefox and Safari return that yue is supported in Segmenter, even though it is not really supported that well. (Chrome does not ship with yue.)

> Intl.DateTimeFormat.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

> Intl.Segmenter.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

It's not entirely clear to me why each component has its own list, especially since, as @anba notes, in practice they all just return the list of locales in ICU, even if they don't make sense for a particular component. If we were designing this from scratch, I feel like better behavior would be a single Intl.supportedLocalesOf and leave it at that.

sffc · 2024-06-01T08:25:05Z

Additional context: unicode-org/icu4x#3284

The CLDR design group agreed earlier this year that type: "grapheme" segmenters should not take a locale parameter at all; they are purely algorithmic based on Unicode properties. The other types of segmenters may use the locale hint to tailor behavior, but it is only a hint, and the fallback is always algorithmic. This is very different from other components such as DateTimeFormat, which has an actual failure mode of falling back to the system locale if it can't find data in the requested locale.

sffc · 2024-10-25T02:06:24Z

2024-10-24 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-10-24.md#consider-relaxing-locale-resolution-for-intlsegmenter-895

We established a use case, which should help guide implementations.

sffc · 2024-12-16T22:13:24Z

CLDR issue with some more notes: https://unicode-org.atlassian.net/browse/CLDR-18187

sffc added s: discuss Status: TG2 must discuss to move forward c: meta Component: intl-wide issues labels Jun 1, 2024

sffc added this to ECMA-402 Meeting Topics Aug 22, 2024

sffc moved this to Priority Issues in ECMA-402 Meeting Topics Aug 22, 2024

sffc added s: blocked Status: the issue is blocked on upstream and removed s: discuss Status: TG2 must discuss to move forward labels Dec 16, 2024

sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Dec 16, 2024

sffc self-assigned this Dec 16, 2024

sffc added this to the ES 2025 milestone Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider relaxing locale resolution for `Intl.Segmenter` #895

Consider relaxing locale resolution for `Intl.Segmenter` #895

jedel1043 commented Jun 1, 2024 •

edited

Loading

anba commented Jun 1, 2024

eemeli commented Jun 1, 2024

sffc commented Jun 1, 2024

sffc commented Jun 1, 2024

sffc commented Oct 25, 2024

sffc commented Dec 16, 2024

Consider relaxing locale resolution for Intl.Segmenter #895

Consider relaxing locale resolution for Intl.Segmenter #895

Comments

jedel1043 commented Jun 1, 2024 • edited Loading

Rationale

anba commented Jun 1, 2024

eemeli commented Jun 1, 2024

sffc commented Jun 1, 2024

sffc commented Jun 1, 2024

sffc commented Oct 25, 2024

sffc commented Dec 16, 2024

Consider relaxing locale resolution for `Intl.Segmenter` #895

Consider relaxing locale resolution for `Intl.Segmenter` #895

jedel1043 commented Jun 1, 2024 •

edited

Loading