Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Collating ranges are broken #5204

Open
muellerj2 opened this issue Dec 23, 2024 · 1 comment
Open

<regex>: Collating ranges are broken #5204

muellerj2 opened this issue Dec 23, 2024 · 1 comment

Comments

@muellerj2
Copy link
Contributor

muellerj2 commented Dec 23, 2024

When the regex_constants::collate flag is set, character ranges fail to match characters according to the locale's collation order.

Test case

#include <iostream>
#include <locale>
#include <regex>

using namespace std;

int main() {
    const locale de("de_DE");

    regex_traits<wchar_t> traits;
    traits.imbue(de);
    wregex re(L"[a-z]", regex_constants::collate);
    re.imbue(de);

    const wchar_t eszett = L'ß';
    const wchar_t s = L's';
    const wchar_t t = L't';
    const wchar_t ss[] = {L's', L's'};
    const wchar_t minus = L'-';
    const auto stransform = traits.transform(&s, &s + 1);
    const auto ttransform = traits.transform(&t, &t + 1);
    const auto eszetttransform = traits.transform(&eszett, &eszett + 1);
    const auto sstransform = traits.transform(begin(ss), end(ss));

    cout << regex_match(L"ß", re) << '\n';

    cout << (stransform <= eszetttransform
                && eszetttransform <= ttransform)
            << ' '
            << (sstransform == eszetttransform)
            << '\n';

    return 0;
}

Godbolt link

Expected behavior

The range should match ß, since ß collates like ss and should be sorted between s and t (and thus between a and z as well). Hence, the output of this program should be:

1
1 1

Actual behavior

The range does not match ß, so the actual output is:

0
1 1

Note that the output of the second line shows that the problem does not lie in the implementation of regex_traits<wchar_t>::transform, as it returns sort keys that are equal for ß and ss and place ß between s and t.

@muellerj2
Copy link
Contributor Author

Related to this, I was initially concerned about the implementation of regex_traits<_Elem>::translate():

STL/stl/inc/regex

Lines 295 to 298 in 9082000

_Elem translate(_Elem _Ch) const { // provide locale-sensitive mapping
string_type _Res = _Getcoll()->transform(_STD addressof(_Ch), _STD addressof(_Ch) + 1);
return _Res.size() == 1 ? _Res[0] : _Ch;
}

This replaces the character by the collation order sort key when the sort key has length 1 and otherwise returns the character itself.

But calling regex_traits<_Elem>::transform() on a sort key does not return the same sort key again, so I was worried that range matching would get broken even more in collate mode if the result of regex_traits<_Elem>::translate() were to be passed to regex_traits<_Elem>::transform() as the standard mandates. (This would even be a problem if we fixed translate() because there is no guarantee the fixed version would be picked up in a mix-and-match situation.)

But after some experimentation, it appears that LCMapStringA/W() always return sort keys with length greater 1. Hence, _Strxfrm()/_Wcsxfrm(), collate<_Elem>::transform() and regex_traits<_Elem>::transform() also return sort keys with length greater 1, except when the C locale is used, as then the input string (composed of argument _Ch only) is just returned as the sort key by_Strxfrm()/_Wcsxfrm().

So it seems that this function effectively just does return _Ch in a very complicated and expensive way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant