Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The way of generating the genesis state root for the domain instance is not deterministic #1798

Closed
NingLin-P opened this issue Aug 10, 2023 · 16 comments · Fixed by #1824
Closed
Assignees
Labels
execution Subspace execution

Comments

@NingLin-P
Copy link
Member

When a domain is instantiated on the consensus chain runtime, we will generate the genesis state root of the domain instance through a host function and use that to construct and import the genesis ER into the block tree. The host function generates the genesis state root by building the genesis block with the RuntimeGenesisConfig and getting the state root from the header of the genesis block:
https://github.com/subspace/subspace/blob/ff48b6e5bc1a6b24c3d89069de2769638cf4c317/crates/subspace-node/src/domain.rs#L108-L118

After the domain is instantiated, at some point of time, when someone starts running an operator node, it will also construct and import the genesis block when constructing the client of the node in the same way, the RuntimeGenesisConfig it used is fetch from the consensus chain state and is also the same:
https://github.com/paritytech/substrate/blob/884a288caf2dfb29333ad2676c2281d31d543e49/client/service/src/builder.rs#L147-L152
https://github.com/paritytech/substrate/blob/884a288caf2dfb29333ad2676c2281d31d543e49/client/service/src/client/client.rs#L399

These 2 genesis state root is expected to be the same, otherwise, when the operator tries to submit its local genesis ER, it will fail due to the mismatch with the genesis state root on the consensus chain state. Unfortunately, the way we used to derive to genesis state root is not deterministic on both the host function and the operator client side.

This branch has added minor changes to the evm domain test runtime (change the spec_version from 0 to 1) and added log to demonstrate the issue. Running any domain test in this branch will fail due to genesis ER mismatch (i.e. cargo test --package domain-client-operator --lib -- tests::test_domain_block_production --exact --nocapture) and from the log you can see even with the same RuntimeGenesisConfig the result genesis storage is different.

My guess of the root cause is that the call of <RuntimeGenesisConfig as BuildStorage>::build_storage is not deterministic, this trait function is implemented for the RuntimeGenesisConfig within the construct_runtime macro, so the implementation can be different from runtime to runtime but it is not included by the runtime and is called outside of the runtime. Which implementation is used is determined by which runtime is compiled with the host function/operator node.

In the above example branch, the host function has complied with the production evm runtime with spec_version = 0 while the operator is compiled with test evm runtime with spec_version = 1 thus they will produce a different state root, if we also change the spec_version of the production evm runtime to 1 then all the test will pass again.

This issue will cause problems when we try to upgrade the runtime version as the above example does. A solution to this issue is to directly store the genesis storage in the consensus runtime instead of the RuntimeGenesisConfig so the operator won't need to call <RuntimeGenesisConfig as BuildStorage>::build_storage and just use the genesis storage to construct the genesis block.

cc @vedhavyas

@vedhavyas
Copy link
Member

In the above example branch, the host function has complied with the production evm runtime with spec_version = 0 while the operator is compiled with test evm runtime with spec_version = 1 thus they will produce a different state root, if we also change the spec_version of the production evm runtime to 1 then all the test will pass again.

This is what we actually want right?
Consensus has Runtime#1 and It uses its runtime to generate the state root
So the domain node operator should use this runtime to generate the state root instead of what they might actually have since they may carry updated runtime version.

The solution would be to use the same runtime on both scenarios @NingLin-P

@NingLin-P
Copy link
Member Author

The solution would be to use the same runtime on both scenarios

The runtime I referring to in "complied with the production evm runtime" is the native runtime, not the wasm runtime. We are always using the same wasm runtime, but as I mentioned above the <RuntimeGenesisConfig as BuildStorage>::build_storage implementation is determined by the native runtime that the node complied with and this function is not something like a runtime API that we can call from the wasm runtime.

If we want to use the same native runtime that literally means using the same release snapshot forever.

@nazar-pc
Copy link
Member

I'm not sure why would anything use native runtime. Not only we specify to use WASM runtime in CLI, it is going away in the future completely. Also Substrate ensures that native runtime is only ever used if it matches WASM runtime version (including spec version), otherwise it will switch back to WASM automatically. I also don't understand what consensus runtime has to do with this.

For context, genesis config is already possible to create within runtime itself in latest Substrate we have imported, it no longer requires std feature. So given wasm blob that generates genesis config itself, genesis state should be 100% deterministic. If it wasn't, it would be an issue on all Substrate-based chains too, but it is not.

Actually, I think we might even want to upload the genesis state as a blob during domain registration rather than introducing a host function. Host function should always be the last resort.

@NingLin-P
Copy link
Member Author

I'm not sure why would anything use native runtime.

We are not directly using the native runtime like executing an extrinsic or invoking a runtime API, what I mean above is that the GenesisRuntimeConfig and its implementation of the BuildStorage and Default trait is defined within the runtime (with std) and use in the consensus chain client side, the return value of GenesisRuntimeConfig::build_storage and GenesisRuntimeConfig::default is determined by the domain runtime that compiles with the consensus chain client.

For context, genesis config is already possible to create within runtime itself in latest Substrate we have imported, it no longer requires std feature. So given wasm blob that generates genesis config itself, genesis state should be 100% deterministic. If it wasn't, it would be an issue on all Substrate-based chains too, but it is not.

This is something new for me, but seems the BuildStorage trait implementation still required std. To generate the genesis state with the given runtime wasm blob and GenesisRuntimeConfig, we may need something like block execution that takes GenesisRuntimeConfig as input and executes in runtime then outputs Storage.

Actually, I think we might even want to upload the genesis state as a blob during domain registration rather than introducing a host function. Host function should always be the last resort.

That will require the user to generate the genesis state manually and we are unable to check if it is well formatted.

@nazar-pc
Copy link
Member

We are not directly using the native runtime like executing an extrinsic or invoking a runtime API, what I mean above is that the GenesisRuntimeConfig and its implementation of the BuildStorage and Default trait is defined within the runtime (with std) and use in the consensus chain client side, the return value of GenesisRuntimeConfig::build_storage and GenesisRuntimeConfig::default is determined by the domain runtime that compiles with the consensus chain client.

This makes no sense to be so far. I don't see GenesisRuntimeConfig in Substrate codebase and I don't see it in out codebase. Can you clariy what are you talking about?

This is something new for me, but seems the BuildStorage trait implementation still required std. To generate the genesis state with the given runtime wasm blob and GenesisRuntimeConfig, we may need something like block execution that takes GenesisRuntimeConfig as input and executes in runtime then outputs Storage.

Storage building is deterministic process given the same input. I don't understand why you think it is not, it would break, as I mentioned, all Substrate-based chains.

That will require the user to generate the genesis state manually and we are unable to check if it is well formatted.

It must decode properly, as long as it does and user pays the fee we shouldn't care.

@NingLin-P
Copy link
Member Author

This makes no sense to be so far. I don't see GenesisRuntimeConfig in Substrate codebase and I don't see it in out codebase. Can you clariy what are you talking about?

Take the following as an entry of how GenesisRuntimeConfig is used to build the genesis state

https://github.com/paritytech/substrate/blob/master/client/service/src/builder.rs#L144
https://github.com/subspace/subspace/blob/main/crates/subspace-node/src/domain.rs#L108

Storage building is deterministic process given the same input. I don't understand why you think it is not, it would break, as I mentioned, all Substrate-based chains.

Indeed, the problem now is we can't guarantee the same input, the wasm runtime is not involved during the storage building process, the input now is somehow the native runtime that compiles with the consensus chain client, to guarantee the same input means using the same release snapshot forever.

That is what I mean by introducing some kind of block execution mechanism to get the wasm runtime involved during the storage building process.

@vedhavyas
Copy link
Member

vedhavyas commented Aug 17, 2023

It must decode properly, as long as it does and user pays the fee we shouldn't care.

We went with host function so that we runtime know whats in the genesis config instead of relying on the domain creator to not add for example sudo to the genesis state of the domain. Using the host function makes it easy to create a domain instance without having to deal with misbehavior and potential fraud proof

Indeed, the problem now is we can't guarantee the same input, the wasm runtime is not involved during the storage building process, the input now is somehow the native runtime that compiles with the consensus chain client, to guarantee the same input means using the same release snapshot forever.

IIUC, given a genesis config and a runtime, we should be able to deterministically generate the state root. If, for example like mentioned above, spec_version was incremented and compiled, and state root generation relies on std, then this is an issue with substrate I believe.

Would be great to know what exactly is causing the change in state function @NingLin-P

@nazar-pc
Copy link
Member

Take the following as an entry of how GenesisRuntimeConfig is used to build the genesis state

https://github.com/paritytech/substrate/blob/master/client/service/src/builder.rs#L144
https://github.com/subspace/subspace/blob/main/crates/subspace-node/src/domain.rs#L108

Okay, so it is RuntimeGenesisConfig and not GenesisRuntimeConfig, that is why I was not able to find it. Of course evm_domain_runtime::RuntimeGenesisConfig::default() will return freshly compiled config, but I do not know what raw_genesis_config is, why is it optional and why to patch it with different code and domain ID afterwards. It all looks confusing to me so far, I do not have the context.

Indeed, the problem now is we can't guarantee the same input, the wasm runtime is not involved during the storage building process, the input now is somehow the native runtime that compiles with the consensus chain client, to guarantee the same input means using the same release snapshot forever.

Why is WASM not involved? WASM that is included in chain spec is the only thing from which chain must always start. If domain doesn't have chain spec, it must have something equivalent. It makes no sense to me that it just blindly calls into the whatever native runtime that is a part of the current binary is.

@NingLin-P
Copy link
Member Author

.. state root generation relies on std ..

This can be confirmed in the code here https://github.com/paritytech/substrate/blob/master/frame/support/procedural/src/pallet/expand/genesis_build.rs#L38-L41

I guess substrate has some way to handle the mismatch genesis state root during the sync, perhaps @nazar-pc would know.

but I do not know what raw_genesis_config is

Sorry for the typo @nazar-pc, raw_genesis_config is the RuntimeGenesisConfig encoded in JSON format, it is set to Some for the genesis domain instance and None for other domain instance created by the user (RuntimeGenesisConfig::default will be used in this case, and we may allow user provided RuntimeGenesisConfig in the future).

why to patch it with different code and domain ID afterwards. It all looks confusing to me so far, I do not have the context.

The runtime code and domain id of a domain instance is already determined and maintained in the consensus chain state, we don't want the RuntimeGenesisConfig to contain a different runtime or domain id so we just reset it with the runtime code and domain id from consensus chain state. Another more important reason is we don't want to store the domain runtime code in the consensus chain state twice, so the runtime code is removed from raw_genesis_config thus we need to reset it.

Why is WASM not involved? WASM that is included in chain spec is the only thing from which chain must always start.

Okay, it is involved but as an argument of build_storage, the implementation of build_storage is still provided by the native runtime that compiles with the consensus chain client.

@nazar-pc
Copy link
Member

This can be confirmed in the code here https://github.com/paritytech/substrate/blob/master/frame/support/procedural/src/pallet/expand/genesis_build.rs#L38-L41

I guess substrate has some way to handle the mismatch genesis state root during the sync, perhaps @nazar-pc would know.

No one is questioning that that part is done by the client. But no, there is nothing that handles genesis root mismatch. Different genesis root is by definition a completely different chain. The fact that operation is done in the client doesn't mean it is not deterministic. It is, given the same input, which is the case given the same chain spec as I mentioned above.

raw_genesis_config is the RuntimeGenesisConfig encoded in JSON format, it is set to Some for the genesis domain instance and None for other domain instance created by the user (RuntimeGenesisConfig::default will be used in this case, and we may allow user provided RuntimeGenesisConfig in the future).

This is where I believe things are wrong. You should create genesis config with the runtime that corresponds to domain you're planning to use. And as mentioned above latest Substrate version is able to do that from no_std environment, so domains can expose a function that you can call from the host after instantiating WASM VM and get the correct genesis state that matches runtime of the domain.

Calling into arbitrary native runtime of arbitrary version is why you see mismatches, it should be fixed and the rest should fall into place I think.

The runtime code and domain id of a domain instance is already determined and maintained in the consensus chain state, we don't want the RuntimeGenesisConfig to contain a different runtime or domain id so we just reset it with the runtime code and domain id from consensus chain state. Another more important reason is we don't want to store the domain runtime code in the consensus chain state twice, so the runtime code is removed from raw_genesis_config thus we need to reset it.

Well, I think it should be the opposite. You take the runtime, you generate genesis config with it. Or at least patch itself with runtime since runtime will probably not be able to return itself as part of genesis config (unless you pass it as an argument alongside domain ID, which would be awkward, but possible).


I think the root issue is that you need genesis config, but genesis config is inherently tied to the runtime. So you should really upload genesis config or chain spec of sorts (maybe simplified, but still) rather than just a runtime. Or the runtime must contain everything necessary to generate its own chains spec. The fact that client then takes that and builds storage and writes things into database on disk is irrelevant and not a source of non-determinism.

@NingLin-P
Copy link
Member Author

The fact that operation is done in the client doesn't mean it is not deterministic. It is, given the same input, which is the case given the same chain spec as I mentioned above.

To build the genesis state there are 2 inputs, the first is the RuntimeGenesisConfig (this is what the chain spec used underlying) the second is the build_storage implementation, my point is we can't guarantee the build_storage to be the same since it is not from the WASM runtime. I know it is kind of anti-intuitive so I have created this branch to demonstrate the issue, see the issue body for how.

I think the root issue is that you need genesis config, but genesis config is inherently tied to the runtime. So you should really upload genesis config or chain spec of sorts (maybe simplified, but still) rather than just a runtime.

The RuntimeGenesisConfig is specified in the chain spec, it is not generated from the runtime, and we do upload the RuntimeGenesisConfig to ensure we are using the RuntimeGenesisConfig to build the genesis state.

The fact that client then takes that and builds storage and writes things into database on disk is irrelevant and not a source of non-determinism.

The build_storage implementation consists of all the #[pallet::genesis_build] of runtime's pallets, so when the runtime changed the build_storage changed, and because it happened in the client and involved the native runtime we can't guarantee the same build_storage implementation.

#1824 Try to fix this by building the genesis state when domain is instantiated on the consensus node and the operator node can use the pre-build genesis state instead of building its own.

I do realize it is still problematic since it involves the native runtime and different versions of consensus node may produce different genesis state. So the solution should be make build_storage as some kind of runtime API that we can invoke with the domain WASM runtime, as long as we ensure the RuntimeGenesisConfig and runtime code is the same the result genesis state must be the same.

@nazar-pc
Copy link
Member

my point is we can't guarantee the build_storage to be the same since it is not from the WASM runtime

It is not coming from native runtime either if I read the code correctly, it doesn't care about runtime at all.

6c325b3#diff-3d14087dd782f18801a80ab0168ce425161be6f42fc25dac99bc3a93116062a9R112-R115

I am not sure this is actually correct. This is probably one way to do it, but not the one we need. We should already have all storage items pre-created as key-value pair (the way they are in the raw chain spec), then you should be able to create an empty storage and fill it with data you actually want. Ideally, that key-value map would be a part of the runtime, probably as a separate section, then we don't need to instantiate and run WASM to extract it.

The RuntimeGenesisConfig is specified in the chain spec, it is not generated from the runtime, and we do upload the RuntimeGenesisConfig to ensure we are using the RuntimeGenesisConfig to build the genesis state.

I think the problem is that you're uploading genesis config, not raw chain spec. They are not the same.

#1824 Try to fix this by building the genesis state when domain is instantiated on the consensus node and the operator node can use the pre-build genesis state instead of building its own.

Why does it need to build something, why is it not uploaded just like chain spec is uploaded to the repo?

I do realize it is still problematic since it involves the native runtime and different versions of consensus node may produce different genesis state. So the solution should be make build_storage as some kind of runtime API that we can invoke with the domain WASM runtime, as long as we ensure the RuntimeGenesisConfig and runtime code is the same the result genesis state must be the same.

We should not use native runtime for this, at all. And version of the consensus node shouldn't matter. Polkadot was upgraded many times and they still have chain spec that is imported and results in the same genesis state just fine. I think the solution like with everything else is to upload an equivalent of genesis chain spec when domain is created.

@NingLin-P
Copy link
Member Author

It is not coming from native runtime either if I read the code correctly, it doesn't care about runtime at all

RuntimeGenesisConfig is imported from the domain runtime and used in the consensus chain client, so I would say it does care about the runtime, and as I mentioned above "We are not directly using the native runtime like executing an extrinsic or invoking a runtime API".

I am not sure this is actually correct.

The code is just used to demonstrate the issue, as you can see the value is just used to compute a hash and print it out to show with the same RuntimeGenesisConfig as input we can get a different genesis state.

I think the problem is that you're uploading genesis config, not raw chain spec. They are not the same.
Why does it need to build something, why is it not uploaded just like chain spec is uploaded to the repo?

In the case of building genesis state it is the same if you take a look at substrate <ChainSpec as BuildStorage>::build_storage just invoke <RuntimeGenesisConfig as BuildStorage>::build_storage internally. #1824 try to upload the genesis state though.

Polkadot was upgraded many times and they still have chain spec that is imported and results in the same genesis state just fine

That is the only part I haven't figured out will try to demonstrate the issue with polkadot.

@nazar-pc
Copy link
Member

In the case of building genesis state it is the same if you take a look at substrate ::build_storage just invoke ::build_storage internally. #1824 try to upload the genesis state though.

But it does that only once on a reproducible machine and then stored in raw form in the raw chain spec. If we do the same we bypass the issue in exactly the same way. That way it doesn't care about runtime since storage is just a key-value pairs that can be easily read and written to the database.

@NingLin-P
Copy link
Member Author

NingLin-P commented Aug 17, 2023

But it does that only once on a reproducible machine and then stored in raw form in the raw chain spec. If we do the same we bypass the issue in exactly the same way.

Okay, this answers why polkadot doesn't have the issue, because polkadot only builds the genesis state once and stores it in the raw chain spec, as long as it uses the same raw chain spec it will use the same genesis state.

So back to our issue, the mismatch of genesis state root issue still exists in the main branch, #1824 tried to fix it by building the genesis state on the consensus node and storing it onchain so the operator won't need to build it again, but because every consensus node will need to build the genesis state and the building process still invoke the native runtime which makes it unreproducible so it is still an issue. The solution we have so far:

  • Let the user provide the genesis state when instantiating domain
    • Bad UX but simplest for developer, we can even get rid of the GenesisReceiptExtension.
    • The malicious user may provide genesis state that sets its account as the sudo account or give it an arbitrary of balance
  • Stick to the current approach to build domain genesis state on the consensus node when domain is instantiate
    • Build it in non_std with domain runtime code and RuntimeGenesisConfig in a deterministic way
    • Could be complicated, feasible but don't how yet

@nazar-pc
Copy link
Member

Why not use something like a chain spec that you upload instead of raw runtime blob?

@NingLin-P NingLin-P added the execution Subspace execution label Sep 1, 2023
@NingLin-P NingLin-P moved this to In Progress in Execution Domains Sep 1, 2023
@NingLin-P NingLin-P moved this from In Progress to In Review in Execution Domains Sep 5, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Execution Domains Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
execution Subspace execution
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants