Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote-ext: fix state download stall on slow connections and reduce memory usage #1295

Merged
merged 16 commits into from
Oct 10, 2023

Conversation

liamaharon
Copy link
Contributor

@liamaharon liamaharon commented Aug 30, 2023

Original PR paritytech/substrate#14746


Fixing stall

Introduction

I experienced an apparent stall downloading state from https://rococo-try-runtime-node.parity-chains.parity.io:443 which was having networking difficulties only responding to my JSONRPC requests with 50-200KB/s of bandwidth.

This PR fixes the issue causing the stall, and generally improves performance remote-ext when it downloads state by greatly reducing the chances of a timeout occuring.

Description

Introduces a new REQUEST_DURATION_TARGET constant and modifies get_storage_data_dynamic_batch_size to

  • Increase or decrease the batch size of the next request depending on whether the elapsed time of the last request was gt or lt the target
  • Reset the batch size to 1 if the request times out

This fixes an issue on slow connections that can otherwise cause multiple timeouts and a stalled download when:

  1. The batch size increases rapidly as remote-ext downloads keys with small associated storage values
  2. remote-ext tries to process a large series of subsequent keys all with extremely large associated storage values (Rococo has a series of keys 1-5MB large)
  3. The huge storage values download for 5 minutes until the request times out
  4. The partially downloaded keys are thrown out and remote-ext tries again with a smaller batch size, but the batch size is still far too large and takes 5 minutes to be reduced again
  5. The download will be essentially stalled for many hours while the above step cycles

After this PR, the request size will

  • Not grow as large to begin with, as it is regulated downwards as the request duration exceeds the target
  • Drop immediately to 1 if the request times out. A timeout indicates the keys next in line to download have extremely large storage values compared to previously downloaded keys, and we need to reset the batch size to figure out what our new ideal batch size is. By not resetting down to 1, we risk the next request timing out again.

Reducing memory

As suggested by @bkchr, I adjusted get_storage_data_dynamic_batch_size from being recursive to a loop which allows removing a bunch of clones that were chewing through a lot of memory. I noticed actually it was using up to 50GB swap previously when downloading Polkadot keys on a slow connection, because it needed to recurse and clone a lot.

After this change it uses only ~1.5GB memory.

@liamaharon liamaharon added the T0-node This PR/Issue is related to the topic “node”. label Aug 30, 2023
@liamaharon liamaharon requested review from bkchr and niklasad1 August 30, 2023 03:25
@liamaharon liamaharon changed the title remote-ext: improve state download performance on slow connections remote-ext: fix state download stall on slow connections Aug 30, 2023
@liamaharon liamaharon changed the title remote-ext: fix state download stall on slow connections remote-ext: fix state download stall on slow connections and reduce memory usage Sep 21, 2023
@liamaharon liamaharon marked this pull request as draft September 21, 2023 14:43
@liamaharon liamaharon marked this pull request as ready for review September 25, 2023 15:02
@liamaharon liamaharon requested a review from bkchr September 25, 2023 15:02
@liamaharon
Copy link
Contributor Author

@bkchr thanks for your suggestions. I've refactored the recursive function to be iterative and implemented your suggestions, and the memory usage is orders of magnitude lower now.

retries = 0;
batch_response
},
Err(e) => {
Copy link
Member

@niklasad1 niklasad1 Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally you should check the kind of error you get back here...

If it' just Errror::Transport(Transport::RequestTooBig) or Error::RequestTimeout then it shouldn't be a problem to continue but unfortunately it's not a concrete type you would need to downcast which is PITA... https://docs.rs/jsonrpsee-http-client/0.20.1/src/jsonrpsee_http_client/transport.rs.html#212

@liamaharon liamaharon merged commit 55f3544 into master Oct 10, 2023
8 checks passed
@liamaharon liamaharon deleted the liam/remote-ext-target-request-time branch October 10, 2023 14:42
ordian added a commit that referenced this pull request Oct 12, 2023
* master: (33 commits)
  ci: set CI_IMAGE back to (now updated) .ci-unified (#1854)
  ci: bump ci image to rust 1.73.0 (#1830)
  Refactor Identity to benchmark v2 (#1838)
  PVF worker: bump landlock, update ABI docs (#1850)
  Xcm emulator nits (#1649)
  Fixes path issue in derive-impl (#1823)
  upgrade to macro_magic 0.4.3 (#1832)
  Use safe math when pruning statuses (#1835)
  remote-ext: fix state download stall on slow connections and reduce memory usage (#1295)
  Update testnet bootnode dns name (#1712)
  [FRAME] Warn on unchecked weight witness (#1818)
  [xcm] Use `Weight::MAX` for `reserve_asset_deposited`, `receive_teleported_asset` benchmarks (#1726)
  Update bridges subtree (#1803)
  Check for parent of first ready block being on chain (#1812)
  Make CheckNonce refuse transactions signed by accounts with no providers (#1578)
  Fix Asset Hub collator crashing when starting from genesis (#1788)
  Mixnet integration (#1346)
  [xcm-emulator] Decouple the `AccountId` type from `AccountId32` (#1458)
  Treasury spends various asset kinds (#1333)
  chore: bump zombienter version (#1806)
  ...
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
…emory usage (paritytech#1295)

Original PR paritytech/substrate#14746

---

## Fixing stall

### Introduction
I experienced an apparent stall downloading state from
`https://rococo-try-runtime-node.parity-chains.parity.io:443` which was
having networking difficulties only responding to my JSONRPC requests
with 50-200KB/s of bandwidth.

This PR fixes the issue causing the stall, and generally improves
performance remote-ext when it downloads state by greatly reducing the
chances of a timeout occuring.

### Description
Introduces a new `REQUEST_DURATION_TARGET` constant and modifies
`get_storage_data_dynamic_batch_size` to

- Increase or decrease the batch size of the next request depending on
whether the elapsed time of the last request was gt or lt the target
- Reset the batch size to 1 if the request times out

This fixes an issue on slow connections that can otherwise cause
multiple timeouts and a stalled download when:

1. The batch size increases rapidly as remote-ext downloads keys with
small associated storage values
2. remote-ext tries to process a large series of subsequent keys all
with extremely large associated storage values (Rococo has a series of
keys 1-5MB large)
3. The huge storage values download for 5 minutes until the request
times out
4. The partially downloaded keys are thrown out and remote-ext tries
again with a smaller batch size, but the batch size is still far too
large and takes 5 minutes to be reduced again
5. The download will be essentially stalled for many hours while the
above step cycles


After this PR, the request size will

- Not grow as large to begin with, as it is regulated downwards as the
request duration exceeds the target
- Drop immediately to 1 if the request times out. A timeout indicates
the keys next in line to download have extremely large storage values
compared to previously downloaded keys, and we need to reset the batch
size to figure out what our new ideal batch size is. By not resetting
down to 1, we risk the next request timing out again.

## Reducing memory

As suggested by @bkchr, I adjusted `get_storage_data_dynamic_batch_size`
from being recursive to a loop which allows removing a bunch of clones
that were chewing through a lot of memory. I noticed actually it was
using up to 50GB swap previously when downloading Polkadot keys on a
slow connection, because it needed to recurse and clone a lot.

After this change it uses only ~1.5GB memory.
bkchr pushed a commit that referenced this pull request Apr 10, 2024
Substrate: 31d90c2
Polkadot: 60df3c5
Cumulus: a9630551c2cd877952ab769c862af4c81b0ccd3c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants