Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow decoding of JSON messages in non-standard character sets #2093

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

codehead
Copy link

Summary

This PR modifies Mojo::Message to allow JSON decoding when a charset is specified. Currently all messages are decoded as UTF-8, so JSON messages with other encodings may fail silently.

Motivation

Mojo::UserAgent always tries to decode messages in UTF-8 ignoring the message character set. When a message does not meet the UTF-8 specification, Mojo::JSON::json_decode() fails silently and Mojo::UserAgent::json() returns undef. In character sets with a large overlap with UTF-8 such as ISO-8859-1, message decoding fails only when accented characters are present, so JSON messages migh seem empty at random.
This is critical to interface with legacy systems that expose JSON messages in charsets different from UTF-8.

References

No public issues or PRs for Mojolicious as far as I'm aware.
Other Perl frameworks allow non-UTF-8 charsets for JSON messages from very early on -- e.g. Catalyst::View::JSON.
Some Java frameworks defaulted to ISO-8859-1 encoding as recently as 2020: Content Type being append with charset=ISO-8859-1 #1428.
Even though RFC4627 tried to stardardize UTF-8 encoding for JSON messages as far back as 2006 (at around the same time some frameworks implemented charset specification to allow non-ascii characters in JSON) RFC8259's language leaves enough slack for systems that are part of a closed ecosystem.

@@ -132,7 +132,8 @@ sub is_limit_exceeded { !!shift->{limit} }
sub json {
my ($self, $pointer) = @_;
return undef if $self->content->is_multipart;
my $data = $self->{json} //= j($self->body);
my $charset = $self->body_params->charset || $self->default_charset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the performance cost of this is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, $self->content->charset is the way to go.

@@ -132,7 +132,8 @@ sub is_limit_exceeded { !!shift->{limit} }
sub json {
my ($self, $pointer) = @_;
return undef if $self->content->is_multipart;
my $data = $self->{json} //= j($self->body);
my $charset = $self->body_params->charset || $self->default_charset;
my $data = $self->{json} //= j($charset eq 'UTF-8' ? $self->body : encode('UTF-8', decode($charset, $self->body)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j(encode(decode())) seems super inefficient.

Copy link
Author

@codehead codehead Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Tried this before:

my $data = $charset eq 'UTF-8' ? j($self->body) : from_json(decode($charset,$self->body));

which seems more natural but breaks the last test in t/mojo/response.t subtest "Parse response and extract JSON data"

t/mojo/response.t .......................... 1/?
    #   Failed test 'right result'
    #   at t/mojo/response.t line 1075.
    #     Structures begin differing at:
    #          $got->[1] = '2'
    #     $expected->[1] = '4'
    # Looks like you failed 1 test of 9.

#   Failed test 'Parse response and extract JSON data'
#   at t/mojo/response.t line 1076.
# Looks like you failed 1 test of 56.

i.e. this one:

  is_deeply $res->json('/baz'), [1, 2, 3], 'right result';
  $res->json->{baz}[1] = 4;
  is_deeply $res->json('/baz'), [1, 4, 3], 'right result'; # <-- This one here

j(encode(decode())) is awkward but avoids that subtle breakage.

I might be missing something, so all suggestions are welcome.

@kraih kraih requested review from a team, marcusramberg, kraih, jberger and Grinnz August 13, 2023 12:02
@jberger
Copy link
Member

jberger commented Aug 13, 2023

I'm out of town and reading on my phone so I'm not going to do a full review, but I think I agree with @kraih there are things in there that don't pass the smell test

@Grinnz
Copy link
Contributor

Grinnz commented Aug 13, 2023

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

@codehead
Copy link
Author

codehead commented Aug 13, 2023

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent failure.

@Grinnz
Copy link
Contributor

Grinnz commented Aug 13, 2023

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

@jberger
Copy link
Member

jberger commented Aug 16, 2023

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

I concur with @Grinnz, there is a difference in that "text" is specified to be in a known (configured) charset, but JSON is by definition utf-8 encoded. As such, bytes that are json-like but encoded as some other charset are technically not actually JSON. I know that's not a very satisfactory answer, and it sounds dismissive, but given the difference there it seems not unreasonable to make the spec-compliant mechanism as simple and fast as possible while still making your use-case possible, if not quite as convenient. It seems that that is already the case.

@kraih
Copy link
Member

kraih commented Aug 26, 2023

I'm afraid it looks like this PR will not pass the vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants