Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider relaxing locale resolution for Intl.Segmenter #895

Open
jedel1043 opened this issue Jun 1, 2024 · 6 comments
Open

Consider relaxing locale resolution for Intl.Segmenter #895

jedel1043 opened this issue Jun 1, 2024 · 6 comments
Assignees
Labels
c: meta Component: intl-wide issues s: blocked Status: the issue is blocked on upstream
Milestone

Comments

@jedel1043
Copy link

jedel1043 commented Jun 1, 2024

Alternative name: Make %Intl.Segmenter%.[[AvailableLocales]] be the full set of all syntactically valid locales (with some caveats like canonicalization/extensions)

Rationale

While most Intl services do require passing a locale for a correct behaviour at runtime, the Segmenter service is in this weird position where it supports almost all locales you throw at it, and the provided locale is just used as a suggestion to segment certain special cases.

This apparently makes it difficult for libraries such as ICU4X to determine if a locale is on their list of [[AvailableLocales]] or not; in that case, only a couple of locales ("km", "lo", "my", "th") are "supported" in the sense that they load some amount of data for them on their data provider. However, the rest of locales are very much "supported", they just don't load locale specific data at runtime. (asking for @sffc's help to add more context about this)

What then? Well, if virtually all locales are "supported" by Segmenter, why not just consider all (see alternative name) syntactically valid locales as supported locales for that service? This would mean making APIs such as Intl.Segmenter.supportedLocalesOf always return everything, which doesn't sound too bad for a service that is basically a low level text processing utility.

@sffc sffc added s: discuss Status: TG2 must discuss to move forward c: meta Component: intl-wide issues labels Jun 1, 2024
@anba
Copy link
Contributor

anba commented Jun 1, 2024

Implementations return all locales supported by ICU4C, which seems like a reasonable thing to do, because there's at least some guarantee that segmentation works for these locales. Returning everything could give the false impression that any locale works here, including locales like Klingon (tlh), Egyptian (egy), Akkadian (akk), etc.

@eemeli
Copy link
Member

eemeli commented Jun 1, 2024

One option would be to return an explicit und for locales that are supported, but for which no additional data is needed.

@sffc
Copy link
Contributor

sffc commented Jun 1, 2024

Text processing utilities, including Segmenter and Collator, work based on scripts and properties more than locales. It doesn't make a whole lot of sense to ask a Segmenter or a Collator "what locales do you support", because they support all locales written in scripts that are encoded in Unicode.

It's a known issue that Segmenter favors majority languages in scripts over minority languages written in the same script (such as Cantonese (yue)). However, CLDR has data for yue in other services, and both Firefox and Safari return that yue is supported in Segmenter, even though it is not really supported that well. (Chrome does not ship with yue.)

> Intl.DateTimeFormat.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

> Intl.Segmenter.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]

It's not entirely clear to me why each component has its own list, especially since, as @anba notes, in practice they all just return the list of locales in ICU, even if they don't make sense for a particular component. If we were designing this from scratch, I feel like better behavior would be a single Intl.supportedLocalesOf and leave it at that.

@sffc
Copy link
Contributor

sffc commented Jun 1, 2024

Additional context: unicode-org/icu4x#3284

The CLDR design group agreed earlier this year that type: "grapheme" segmenters should not take a locale parameter at all; they are purely algorithmic based on Unicode properties. The other types of segmenters may use the locale hint to tailor behavior, but it is only a hint, and the fallback is always algorithmic. This is very different from other components such as DateTimeFormat, which has an actual failure mode of falling back to the system locale if it can't find data in the requested locale.

@sffc sffc moved this to Priority Issues in ECMA-402 Meeting Topics Aug 22, 2024
@sffc
Copy link
Contributor

sffc commented Oct 25, 2024

2024-10-24 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-10-24.md#consider-relaxing-locale-resolution-for-intlsegmenter-895

We established a use case, which should help guide implementations.

@sffc
Copy link
Contributor

sffc commented Dec 16, 2024

CLDR issue with some more notes: https://unicode-org.atlassian.net/browse/CLDR-18187

@sffc sffc added s: blocked Status: the issue is blocked on upstream and removed s: discuss Status: TG2 must discuss to move forward labels Dec 16, 2024
@sffc sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Dec 16, 2024
@sffc sffc self-assigned this Dec 16, 2024
@sffc sffc added this to the ES 2025 milestone Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: meta Component: intl-wide issues s: blocked Status: the issue is blocked on upstream
Projects
Status: Previously Discussed
Development

No branches or pull requests

4 participants