Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of maxBytes() #749

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jamiebuilds
Copy link

@jamiebuilds jamiebuilds commented Jul 26, 2024

I'd like to use maxBytes() to help prevent certain types of abuse sending large amounts of data to clients in order to overload them. This makes it fairly important that maxBytes() is as fast as possible and is not itself a bottleneck if someone attempts to send hundreds of megabytes of data

Before:

Since the current implementation will always read all of the bytes of the string, there is no significant difference between size limits enforced

input/limit any size
1B 1’894’455.1
5MB 1’507.1
50MB 150.1
500MB 13.4
max 12.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

After: (ops/sec)

Caching Uint8Array speeds up 1B inputs by about 30x, and because this new implementation stops reading/writing N+4 after the maxBytes requirement, the performance is more dependent on the requirement than the input itself

Note: It needs to read past the requirement to test if the string is too long, the +4 bytes is because the next character in the string could be anywhere from 1-4 bytes, and encodeInto() will drop it if there's not enough room in the array.

input/limit <=5MB <=50MB <=500MB
1B 30’240’582.3 31’011’981.5 30’389’671.3
5MB 6’611.1 7’095.6 7’071.8
50MB 6’736.9 551.4 498.7
500MB 6’685.8 547.4 51.6
max 6’318.7 538.0 51.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

@fabian-hiller fabian-hiller self-assigned this Jul 27, 2024
@fabian-hiller fabian-hiller added the enhancement New feature or request label Jul 27, 2024
@fabian-hiller
Copy link
Owner

Thanks for the catch! Why would you prefer to use maxBytes instead of maxLength?

@jamiebuilds
Copy link
Author

It just tends to be how we specify things. JavaScript strings are UTF-16 encoded, but many languages default to UTF-8 encoded (Rust, Swift, Golang, Ruby, etc). If you don't check these in an encoding-aware way, you end up with strings that are allowed to be different lengths depending on the language you are checking them in.

Aside: I've also opened an issue in the whatwg/encoding standard to suggest a faster method for this whatwg/encoding#333

@jamiebuilds
Copy link
Author

jamiebuilds commented Jul 29, 2024

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

if (dataset.value > maxBytes) {
  // The minimum number of bytes is already too long
}

if (dataset.value * 3 <= maxBytes) {
  // The maximum number of bytes is already small enough
}

@fabian-hiller
Copy link
Owner

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

Can you provide more details? When does this work? In general, how should I proceed with this PR? What is your recommendation?

@jamiebuilds
Copy link
Author

Sure, let me give you some context:

Encoding is expensive

Strings in JavaScript are sequences of UTF-16 code units.

  • In UTF-16, each character is 1 or 2 code units (2 or 4 bytes)
  • In UTF-8, each character is 1 to 4 code units (1 to 4 bytes)

Converting a sequence of UTF-16 code units to UTF-8 code units is relatively expensive operation that involves a fair bit of math, but right now it's the only way to calculate the UTF-8 byte length of a string in browsers.

It's much faster to calculate just the number of bytes in a string because you can skip the work to convert them into their specific values and just match UTF-16 ranges to byte values. This is the primary reason why functions like Buffer.byteLength(input) are much faster than new TextEncoder().encode(input) (besides not needing to allocate more memory).

But since that's not an option on the web, you could at least avoid encoding the entire string, and only check if it's too long, which is what this PR does with encodeInto()

Skipping encoding when possible

You can optimize this even further by avoiding encoding at all with a little bit of knowledge about how UTF-16 code units get converted to UTF-8 bytes.

Char UTF-16 UTF-8 UTF-16 units to UTF-8 bytes
'a' 0061 (1 units → 2 bytes) 41 (1 units → 1 bytes) 1 → 1 (min)
'¢' 00A2 (1 units → 2 bytes) C2 A2 (2 units → 2 bytes) 1 → 2
'ก' 0E01 (1 units → 2 bytes) E0 B8 81 (3 units → 3 bytes) 1 → 3 (max)
'𝄢' D834 DD22 (2 units → 4 bytes) F0 9D 84 A2 (4 units → 4 bytes) 2 → 4

The conversion ratio of UTF-16 code units when encoded in UTF-8 is 1-3 bytes.

So without having to encode anything, we can know that the max and min possible byte length for any JavaScript string just by doing:

let MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT = 1
let MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT = 3

let min = string.length * MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT
let max = string.length * MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT

Not needing to encode anything will speed up the vast majority of usage of maxBytes()

Optimized Solution

This is a slightly updated version of the current PR which is currently the fastest option for asserting that a string is under a certain UTF-8 byte length:

let encoder: TextEncoder
function maxBytes(bytes: number) {
  let array: Uint8Array
  return function check(input: string): boolean {
    if (input.length > requirement) return false
    if (input.length * 3 <= requirement) return true
    encoder ??= new TextEncoder();
    array ??= new Uint8Array(bytes);
    let read = cachedTextEncoder.encodeInto(input, cachedUint8Array).read
    return read <= input.length
  }
}

Problem

The only problem is that this doesn't give you a received value of a specific byte length. The only reason that this is fast is because it gives up computing the exact length of the string once it's past the limit.

If you're okay with dropping the received value from the issue that this adds, you could use this more optimized version

@fabian-hiller
Copy link
Owner

Thanks for the details! I now understand the problem and possible solutions much better. I am not sure what to do. We use the expected and received pattern everywhere. So I am not sure if we should do an execution here. On the other hand, of course, I see the downside that this could be abused by sending extrem long strings to the server.

@jamiebuilds
Copy link
Author

Some options:

  • Drop received
  • Make received an estimate `${string.length} to ${string.length * 3}`
  • Make received an estimate only at larger sizes
  • Make received the UTF-16 bytes `${string.length * 2} UTF-16 bytes`
  • Add an option to disable received

@fabian-hiller
Copy link
Owner

I think at the moment I prefer to wait until more developers encounter this problem to get more feedback on how to proceed. In the meantime, as a workaround, you could implement a fastMaxBytes action yourself for use in your own schemas.

@fabian-hiller fabian-hiller added the workaround Workaround fixes problem label Aug 10, 2024
@fabian-hiller fabian-hiller force-pushed the main branch 5 times, most recently from 41ae798 to 4a7ee57 Compare September 1, 2024 21:07
@fabian-hiller fabian-hiller force-pushed the main branch 2 times, most recently from 7a97dd0 to b5f7f4d Compare October 3, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request workaround Workaround fixes problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants