-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amount of bytes to sniff for encoding detection #102
Comments
I think Firefox uses at 1024 bytes and I think that is what we should aim for anything where we need to resort to sniffing (it seems like that discussion isn't over yet either; when we need to sniff and for which encodings). The more bytes you wait for the longer it takes for the user to see anything, especially on slower connections. (Although I guess you can still start decoding ASCII bytes in some generic decoder maybe as long as you don't get special ISO-2022-JP bytes wrong, since we should not sniff for UTF-16. (Or should we sniff for UTF-16? I believe WebKit does it, but only for the first couple of bytes so that would not matter here.)) |
For what it's worth, the spec kind of documents this: https://html.spec.whatwg.org/multipage/syntax.html#encoding-sniffing-algorithm
|
I think we should do 1024 bytes for consistency with Firefox had the 1024-byte limit for content-based sniffing during the Firefox 4 development cycle. However, that was known to break one site: Japanese Planet Debian (EUC-JP, but these days properly declared). Decision-makers postulated that the one site would only a tip of an iceberg, so I had to change the detector to run on more bytes. So today, it the Japanese detector is enabled, it runs on the first 1024 bytes before the parse starts. Then in continues to run, and if it revises its guess during the parse, the page in re-navigated to with the newly-guessed encoding. I guess we should gather telemetry to see how often this happens. |
Why would consistency with be
This sounds complex and bug-prone, so I don't think we would be willing to introduce similar behavior in Chromium. I would much rather increase the constant than resort to this.
Yes, I would be very interested in hearing the results of telemetry. |
To rely on the Firefox experience that waiting for 1024 without a timeout doesn't break the Web. (I.e. there have been only a couple of complaints since Firefox 4 about scripts that intentionally send content piecemeal but unintentionally not having an early Also, if the buffer size for byte pattern detection was greater than the buffer size for (On the topic of assessing necessity, it would be good to hear from Edge devs what exactly, if anything, they do about byte pattern sniffing considering that they don't have a manual override menu as a recourse, either.)
Increasing the constant makes the probability of breakage go up for pages that intentionally stall the TCP stream. Adding a timeout to address that concern has the problem of the relationship between the input bytes to DOM tree becoming timing-dependent. |
OK, this is the argument that makes the most sense me for holding to 1024. Do you know why some pages do this and what signal they are waiting for to unstall the stream? |
I'm planning to get some stats to understand the impact of switching the size to 1024 (i.e. documents that give different encoding guesses to the input size. 1024 vs. 2048, 4096. etc). |
Ran the detector over substantial amount of unlabelled documents and got following stat:
(The rest 1.40% is exception) 84.36% of the unlabelled HTML documents return the correct text encoding when their first 1K is fed to the detector (meaning the guessed encoding remained the same even if we gave a bigger chunk like 2K ~ 4K). Quite big but it doesn't look like a decisive number either. |
One sixth of detections failing seems like a high error rate, so I think this data indicates we should standardize the number to 4096. For the slippery-slope argument that Chromium might eventually increase meta tag scanning number to 4096 as well, we'd be happy to add a unit test (perhaps web platform test?) verifying that meta tag scan fails if past 1024 bytes. |
I'd like some more detail on this detector. I assume the goal is to avoid reloading, so would we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes? Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible? Does it only run as a last resort? I also know that Henri has been exploring alternative strategies for locale-specific fallback. Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector. |
You can see the API to the CED detector in https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.h .
My thinking (and perhaps we can spec this) is that if the algorithm would go:
I think you meant to say UTF-7? We have the choice, we can set
I agree TLD is an important signal. The CED detector already takes in a "url" parameter of which it primarily looks at the TLD. It uses it as a tiebreaker in cases where the content encoding is ambiguous. The percentages in the data above already take the URL into consideration, so the 84% success rate represents the best we can do given the first 1024 bytes plus the TLD. |
Earlier in the sentence you said you wouldn't feed things into the parser so this wouldn't apply, right?
No, UTF-7 must obviously not be detected. It's not an encoding to begin with per the Encoding Standard. It's still unclear to me what motivated the CED detector as other browsers don't have a similar complex thing. Also looking through the code of the CED detector it doesn't seem aligned with the Encoding Standard at all. https://github.com/google/compact_enc_det/blob/master/util/encodings/encodings.cc lists very different encodings and labels. |
That's correct - there won't be reparsing. Feeding will be delayed till encoding gets detected.
CED can be (and is) configured to conform to the standard (HTML5 mode). Users have to switch encoding via menu or plugin manually when viewing legacy web pages without encoding label. Many of them have no idea what the concept of encoding is, how to get the page right, or which encoding to choose. CED tries to address that by making the best guess for those pages. |
There is no standard for encoding detection though. And the description at https://github.com/google/compact_enc_det/blob/master/util/encodings/encodings.cc#L835 for encoding labels goes against https://encoding.spec.whatwg.org/ and therefore also against HTML. It's not clear to me what of that ends up being exposed to Chrome though, maybe it's not? But we'd really need a much more detailed description or set of test cases to know what exactly Chrome has unilaterally added to the web platform. |
This will give you an idea how CED works in web standard-compliant way:
I think the better thread for this question is #68 |
The encoding name strings in CED aren't web-exposed in Chrome. They didn't become valid meta tag encoding names or anything.
We are currently detecting ISO-2022-JP and UTF-16. We'll make another triage pass and come up with a minimal whitelist of necessary ones (which almost certainly shouldn't contain UTF-16 at least).
What motivated it is that the encoding selection menu other browsers have isn't a workable solution on phones where UI space is limited. Also, Firefox, IE and older versions of Chrome do have autodetectors, they're just not enabled by default. |
AFAICT, Edge has no such menu. How what detection, if any, does Edge have? Does Safari on iOS have a menu? (I don't have a iDevice at hand to test with.) What detection does Safari (Mac and iOS) have? |
Safari on iOS doesn't have a menu as far as I know. It appears they use the ICU detector, though I haven't written tests to verify: https://github.com/WebKit/webkit/blob/master/Source/WebCore/platform/text/TextEncodingDetectorICU.cpp. (It seems that's a contribution from Google...) |
In the non-file: URL case for unlabeled content, since Firefox 73, when the meta prescan fails at 1024 bytes, Firefox feeds those 1024 bytes to chardetng, asks chardetng to make a guess, and then starts parsing HTML based on that guess. All the rest of the bytes ares still fed to chardetng in addition to being fed to the HTML parser. When the EOF is reached, chardetng guesses again. If the guess differs from the previous guess, the page is reloaded with this second guess applied. chardetng does not run on the .in, .lk, and .jp TLDs. The two former don't trigger a detector at all, and .jp uses a special-purpose detector discussed in another issue. As for Safari, I didn't find any evidence of the ICU-based detector mentioned in the previous comment being in use in Safari on Mac. However, the Japanese-specific detector in WebKit appears to be in effect when Safari's UI language is set to Japanese. |
This issue is kind of spinned off of #68 to have a bug dedicated for the specific issue about the data size to be used for encoding detection, which was raised in the thread.
Currently, Blink feeds encoding detector the first chunk of data it gets from network to guess the text encoding used in a give document. The size varies, depending on external conditions like network configuration, and it is possible that the detected encoding result also also vary. I'm experimenting the idea discussed in https://bugs.chromium.org/p/chromium/issues/detail?id=691985#c16 so that always the same amount of data for a document will be fed to encoding detector in order to get consistent result regardless of other conditions.
The size I'm thinking of is 4K though 1K was initially suggested in the Chromium bug entry. Size of the entire document will be used if it is smaller than that. Are there any values used in other browsers for a reference? I prefer more than 1K for encoding detection since in some anecdotal observations I made, documents have ascii-only tags and scripts in the first 1K, which are not enough clue for the detector to make a right guess with.
FWIW, Blink looks at the entire
<head>
for a charset meta tag, or scans up to first 1024 bytes to find one even past<head>
. There must have been a historical reason behind that I'm not aware of.The text was updated successfully, but these errors were encountered: