Improve performance of maxBytes() #749

jamiebuilds · 2024-07-26T19:03:52Z

I'd like to use maxBytes() to help prevent certain types of abuse sending large amounts of data to clients in order to overload them. This makes it fairly important that maxBytes() is as fast as possible and is not itself a bottleneck if someone attempts to send hundreds of megabytes of data

Before:

Since the current implementation will always read all of the bytes of the string, there is no significant difference between size limits enforced

input/limit	any size
1B	1’894’455.1
5MB	1’507.1
50MB	150.1
500MB	13.4
max	12.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

After: (ops/sec)

Caching Uint8Array speeds up 1B inputs by about 30x, and because this new implementation stops reading/writing N+4 after the maxBytes requirement, the performance is more dependent on the requirement than the input itself

Note: It needs to read past the requirement to test if the string is too long, the +4 bytes is because the next character in the string could be anywhere from 1-4 bytes, and encodeInto() will drop it if there's not enough room in the array.

input/limit	<=5MB	<=50MB	<=500MB
1B	30’240’582.3	31’011’981.5	30’389’671.3
5MB	6’611.1	7’095.6	7’071.8
50MB	6’736.9	551.4	498.7
500MB	6’685.8	547.4	51.6
max	6’318.7	538.0	51.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

… pairs

fabian-hiller · 2024-07-27T15:48:59Z

Thanks for the catch! Why would you prefer to use maxBytes instead of maxLength?

jamiebuilds · 2024-07-29T15:59:47Z

It just tends to be how we specify things. JavaScript strings are UTF-16 encoded, but many languages default to UTF-8 encoded (Rust, Swift, Golang, Ruby, etc). If you don't check these in an encoding-aware way, you end up with strings that are allowed to be different lengths depending on the language you are checking them in.

Aside: I've also opened an issue in the whatwg/encoding standard to suggest a faster method for this whatwg/encoding#333

jamiebuilds · 2024-07-29T16:53:23Z

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

if (dataset.value > maxBytes) {
  // The minimum number of bytes is already too long
}

if (dataset.value * 3 <= maxBytes) {
  // The maximum number of bytes is already small enough
}

fabian-hiller · 2024-07-29T19:13:35Z

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

Can you provide more details? When does this work? In general, how should I proceed with this PR? What is your recommendation?

jamiebuilds · 2024-07-31T00:41:53Z

Sure, let me give you some context:

Encoding is expensive

Strings in JavaScript are sequences of UTF-16 code units.

In UTF-16, each character is 1 or 2 code units (2 or 4 bytes)
In UTF-8, each character is 1 to 4 code units (1 to 4 bytes)

Converting a sequence of UTF-16 code units to UTF-8 code units is relatively expensive operation that involves a fair bit of math, but right now it's the only way to calculate the UTF-8 byte length of a string in browsers.

It's much faster to calculate just the number of bytes in a string because you can skip the work to convert them into their specific values and just match UTF-16 ranges to byte values. This is the primary reason why functions like Buffer.byteLength(input) are much faster than new TextEncoder().encode(input) (besides not needing to allocate more memory).

But since that's not an option on the web, you could at least avoid encoding the entire string, and only check if it's too long, which is what this PR does with encodeInto()

Skipping encoding when possible

You can optimize this even further by avoiding encoding at all with a little bit of knowledge about how UTF-16 code units get converted to UTF-8 bytes.

Char	UTF-16		UTF-8		UTF-16 units to UTF-8 bytes
`'a'`	`0061`	(1 units → 2 bytes)	`41`	(1 units → 1 bytes)	1 → 1 (min)
`'¢'`	`00A2`	(1 units → 2 bytes)	`C2 A2`	(2 units → 2 bytes)	1 → 2
`'ก'`	`0E01`	(1 units → 2 bytes)	`E0 B8 81`	(3 units → 3 bytes)	1 → 3 (max)
`'𝄢'`	`D834 DD22`	(2 units → 4 bytes)	`F0 9D 84 A2`	(4 units → 4 bytes)	2 → 4

The conversion ratio of UTF-16 code units when encoded in UTF-8 is 1-3 bytes.

So without having to encode anything, we can know that the max and min possible byte length for any JavaScript string just by doing:

let MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT = 1
let MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT = 3

let min = string.length * MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT
let max = string.length * MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT

Not needing to encode anything will speed up the vast majority of usage of maxBytes()

Optimized Solution

This is a slightly updated version of the current PR which is currently the fastest option for asserting that a string is under a certain UTF-8 byte length:

let encoder: TextEncoder
function maxBytes(bytes: number) {
  let array: Uint8Array
  return function check(input: string): boolean {
    if (input.length > requirement) return false
    if (input.length * 3 <= requirement) return true
    encoder ??= new TextEncoder();
    array ??= new Uint8Array(bytes);
    let read = cachedTextEncoder.encodeInto(input, cachedUint8Array).read
    return read <= input.length
  }
}

Problem

The only problem is that this doesn't give you a received value of a specific byte length. The only reason that this is fast is because it gives up computing the exact length of the string once it's past the limit.

If you're okay with dropping the received value from the issue that this adds, you could use this more optimized version

fabian-hiller · 2024-07-31T08:54:38Z

Thanks for the details! I now understand the problem and possible solutions much better. I am not sure what to do. We use the expected and received pattern everywhere. So I am not sure if we should do an execution here. On the other hand, of course, I see the downside that this could be abused by sending extrem long strings to the server.

jamiebuilds · 2024-08-06T16:35:08Z

Some options:

Drop received
Make received an estimate `${string.length} to ${string.length * 3}`
Make received an estimate only at larger sizes
Make received the UTF-16 bytes `${string.length * 2} UTF-16 bytes`
Add an option to disable received

fabian-hiller · 2024-08-10T09:53:21Z

I think at the moment I prefer to wait until more developers encounter this problem to get more feedback on how to proceed. In the meantime, as a workaround, you could implement a fastMaxBytes action yourself for use in your own schemas.

jamiebuilds added 2 commits July 26, 2024 11:48

Improve performance of maxBytes()

3e18037

Need to read 4 bytes past limit to account for up to 4 byte surrogate…

9cb8eeb

… pairs

fabian-hiller self-assigned this Jul 27, 2024

fabian-hiller added the enhancement New feature or request label Jul 27, 2024

fabian-hiller added the workaround Workaround fixes problem label Aug 10, 2024

fabian-hiller force-pushed the main branch from a3edffe to 4d8a1db Compare August 23, 2024 15:11

fabian-hiller force-pushed the main branch 5 times, most recently from 41ae798 to 4a7ee57 Compare September 1, 2024 21:07

fabian-hiller force-pushed the main branch 2 times, most recently from 7a97dd0 to b5f7f4d Compare October 3, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of maxBytes() #749

Improve performance of maxBytes() #749

jamiebuilds commented Jul 26, 2024 •

edited

Loading

fabian-hiller commented Jul 27, 2024

jamiebuilds commented Jul 29, 2024

jamiebuilds commented Jul 29, 2024 •

edited

Loading

fabian-hiller commented Jul 29, 2024

jamiebuilds commented Jul 31, 2024

fabian-hiller commented Jul 31, 2024

jamiebuilds commented Aug 6, 2024

fabian-hiller commented Aug 10, 2024

Improve performance of maxBytes() #749

Are you sure you want to change the base?

Improve performance of maxBytes() #749

Conversation

jamiebuilds commented Jul 26, 2024 • edited Loading

Before:

After: (ops/sec)

fabian-hiller commented Jul 27, 2024

jamiebuilds commented Jul 29, 2024

jamiebuilds commented Jul 29, 2024 • edited Loading

fabian-hiller commented Jul 29, 2024

jamiebuilds commented Jul 31, 2024

Encoding is expensive

Skipping encoding when possible

Optimized Solution

Problem

fabian-hiller commented Jul 31, 2024

jamiebuilds commented Aug 6, 2024

fabian-hiller commented Aug 10, 2024

jamiebuilds commented Jul 26, 2024 •

edited

Loading

jamiebuilds commented Jul 29, 2024 •

edited

Loading