-
-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of maxBytes() #749
base: main
Are you sure you want to change the base?
Conversation
Thanks for the catch! Why would you prefer to use |
It just tends to be how we specify things. JavaScript strings are UTF-16 encoded, but many languages default to UTF-8 encoded (Rust, Swift, Golang, Ruby, etc). If you don't check these in an encoding-aware way, you end up with strings that are allowed to be different lengths depending on the language you are checking them in. Aside: I've also opened an issue in the whatwg/encoding standard to suggest a faster method for this whatwg/encoding#333 |
You can actually make this even faster by avoiding calculating the specific number of bytes when possible: if (dataset.value > maxBytes) {
// The minimum number of bytes is already too long
}
if (dataset.value * 3 <= maxBytes) {
// The maximum number of bytes is already small enough
} |
Can you provide more details? When does this work? In general, how should I proceed with this PR? What is your recommendation? |
Sure, let me give you some context: Encoding is expensiveStrings in JavaScript are sequences of UTF-16 code units.
Converting a sequence of UTF-16 code units to UTF-8 code units is relatively expensive operation that involves a fair bit of math, but right now it's the only way to calculate the UTF-8 byte length of a string in browsers. It's much faster to calculate just the number of bytes in a string because you can skip the work to convert them into their specific values and just match UTF-16 ranges to byte values. This is the primary reason why functions like But since that's not an option on the web, you could at least avoid encoding the entire string, and only check if it's too long, which is what this PR does with Skipping encoding when possibleYou can optimize this even further by avoiding encoding at all with a little bit of knowledge about how UTF-16 code units get converted to UTF-8 bytes.
The conversion ratio of UTF-16 code units when encoded in UTF-8 is 1-3 bytes. So without having to encode anything, we can know that the max and min possible byte length for any JavaScript string just by doing: let MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT = 1
let MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT = 3
let min = string.length * MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT
let max = string.length * MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT Not needing to encode anything will speed up the vast majority of usage of Optimized SolutionThis is a slightly updated version of the current PR which is currently the fastest option for asserting that a string is under a certain UTF-8 byte length: let encoder: TextEncoder
function maxBytes(bytes: number) {
let array: Uint8Array
return function check(input: string): boolean {
if (input.length > requirement) return false
if (input.length * 3 <= requirement) return true
encoder ??= new TextEncoder();
array ??= new Uint8Array(bytes);
let read = cachedTextEncoder.encodeInto(input, cachedUint8Array).read
return read <= input.length
}
} ProblemThe only problem is that this doesn't give you a If you're okay with dropping the |
Thanks for the details! I now understand the problem and possible solutions much better. I am not sure what to do. We use the expected and received pattern everywhere. So I am not sure if we should do an execution here. On the other hand, of course, I see the downside that this could be abused by sending extrem long strings to the server. |
Some options:
|
I think at the moment I prefer to wait until more developers encounter this problem to get more feedback on how to proceed. In the meantime, as a workaround, you could implement a |
41ae798
to
4a7ee57
Compare
7a97dd0
to
b5f7f4d
Compare
I'd like to use
maxBytes()
to help prevent certain types of abuse sending large amounts of data to clients in order to overload them. This makes it fairly important thatmaxBytes()
is as fast as possible and is not itself a bottleneck if someone attempts to send hundreds of megabytes of dataBefore:
Since the current implementation will always read all of the bytes of the string, there is no significant difference between size limits enforced
After: (ops/sec)
Caching Uint8Array speeds up 1B inputs by about 30x, and because this new implementation stops reading/writing
N+4
after the maxBytesrequirement
, the performance is more dependent on therequirement
than the input itself