Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky zombienet test zombienet-cumulus-0002-pov_recovery #6573

Closed
Tracked by #5852
pepoviola opened this issue Nov 21, 2024 · 12 comments · Fixed by #7011
Closed
Tracked by #5852

Fix flaky zombienet test zombienet-cumulus-0002-pov_recovery #6573

pepoviola opened this issue Nov 21, 2024 · 12 comments · Fixed by #7011
Assignees
Labels
T10-tests This PR/Issue is related to tests.

Comments

@pepoviola
Copy link
Contributor

pepoviola commented Nov 21, 2024

Some examples from Tuesday

Fail to start (node three)
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7786180
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785532
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784688
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784356
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784158
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7778904
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7778369

https://grafana.teleport.parity.io/goto/YZ5Wuv7Ng?orgId=1


2024-11-19 23:27:57.675 | 0: rust_begin_unwind |  
-- | -- | --
  |   | 2024-11-19 23:27:57.676 | 1: core::panicking::panic_fmt |  
  |   | 2024-11-19 23:27:57.676 | 2: core::panicking::panic |  
  |   | 2024-11-19 23:27:57.676 | 3: futures_util::future::future::FutureExt::poll_unpin |  
  |   | 2024-11-19 23:27:57.677 | 4: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.682 | 5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.683 | 6: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.683 | 7: polkadot_overseer::spawn::{{closure}} |  
  |   | 2024-11-19 23:27:57.684 | 8: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.684 | 9: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.685 | 10: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.691 | 11: tokio::runtime::context::runtime::enter_runtime |  
  |   | 2024-11-19 23:27:57.694 | 12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll |  
  |   | 2024-11-19 23:27:57.695 | 13: tokio::runtime::task::core::Core<T,S>::poll |  
  |   | 2024-11-19 23:27:57.695 | 14: tokio::runtime::task::harness::Harness<T,S>::poll |  
  |   | 2024-11-19 23:27:57.696 | 15: tokio::runtime::blocking::pool::Inner::run |  
  |   | 2024-11-19 23:27:57.696 | note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. |  
  |   | 2024-11-19 23:27:57.697 | 2024-11-19 22:27:57.697  WARN tokio-runtime-worker polkadot_overseer: [Relaychain] Terminating due to subsystem exit subsystem="network-bridge-rx" |  
  |   | 2024-11-19 23:27:57.697 | 2024-11-19 22:27:57.697  WARN tokio-runtime-worker polkadot_overseer: [Relaychain] termination error detected, dropping but terminating the execution subsystem="network-bridge-rx" err=Canceled |  
  |   | 2024-11-19 23:27:57.697 | 2024-11-19 22:27:57.697 ERROR tokio-runtime-worker overseer: [Relaychain] subsystem finished unexpectedly subsystem=Ok(())


Job timeout (1hr)
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7786118
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785725
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785632
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785631
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785467
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785047
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7785046
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784814
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784395
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784394
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7784326
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7783865
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7783800
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7783336
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782963
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782782
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782653
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782364
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782290
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782160
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782095
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7782030
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7781962
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7781897
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7780855
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7780523
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7780495
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7780314
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7780056
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779863
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779772
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779765
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779680
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779553
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779490
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779425
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779332
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779298
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779102
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779102
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7779036
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7778876
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7778239

cc: @paritytech/sdk-node

@pepoviola pepoviola added the T10-tests This PR/Issue is related to tests. label Nov 21, 2024
@pepoviola pepoviola mentioned this issue Nov 21, 2024
16 tasks
@pepoviola
Copy link
Contributor Author

pepoviola commented Dec 5, 2024

Failures from yesterday

https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866513
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866450
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866294
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866258
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866195
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866102
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866101
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866100
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866099
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866067
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7866004
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865941
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865878
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865748
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865684
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865620
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865414
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865250
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865187
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865159
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7865123
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864913
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864786
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864595
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864567
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864531
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864396
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864333
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864270
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864142
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7864078
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7863817
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7863632
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7863432
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7863309
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7863238
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862986
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862953
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862888
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862860
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862856
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862836
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862801
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862707
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862483
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7862396
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7861851
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7861715
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7861585
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7861165
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7861035
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7860901
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7860101
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7860072
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7860038
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859908
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859845
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859782
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859716
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859688
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859648
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859620
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859581
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859516
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859488
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859487
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859449
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859382
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859317
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859288
https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7859053

cc: @paritytech/sdk-node

@skunert
Copy link
Contributor

skunert commented Dec 16, 2024

This zombienet test also fails locally consistently at registering the parachain. The tx sent here is not finalized.

validator-0: js-script ./register-para.js with "2000" within 240 seconds

This is really strange. I found the issue was introduced with #6400. Not sure why removing the network starter causes grandpa issue. Reverting that PR makes the test pass again.

@paritytech/networking Please take a deeper look into this.

@bkchr
Copy link
Member

bkchr commented Dec 16, 2024

@skunert can you please more logs? What you mean by the "tx is not finalized"? Only the tx never makes it into a block or does finality doesn't work at all?

@pepoviola
Copy link
Contributor Author

Finalization doesn't work at all when we use the reserved flags
#6773 (comment)

Thx!

@skunert
Copy link
Contributor

skunert commented Dec 16, 2024

@bkchr assembled some more logs:

All node logs:
working.zip

broken.zip

@dmitry-markin
Copy link
Contributor

Let me clarify, the finalization is stalled if and only if both is true:

  1. Remove network starter that is no longer needed #6400 is in place
  2. --reserved-only is passed?

@skunert
Copy link
Contributor

skunert commented Dec 16, 2024

Let me clarify, the finalization is stalled if and only if both is true:

  1. Remove network starter that is no longer needed #6400 is in place

  2. --reserved-only is passed?

That is the working assumption. In my tests I focused on finding the commit that broke it. Did not verify the involvement of the reserved only flag.

@pepoviola
Copy link
Contributor Author

pepoviola commented Dec 16, 2024

I just deployed to networks ( from latest image of master), with --reserved-only the finalization doesn't work at all, here are the logs https://grafana.teleport.parity.io/goto/S1zBJdIHR?orgId=1

And without the flag finalization works as expected, here are the logs https://grafana.teleport.parity.io/goto/ss2DJdIHg?orgId=1

I will deploy other networks (without #6400) to check if the behavior is the same.

Thx!

@pepoviola
Copy link
Contributor Author

Small update, I deployed the same config with the previous commit from master (05ad5475dec748a8a30685bc29a7caba6e63c7ab) and both cases works as expected. Here are the losg:
With reserved https://grafana.teleport.parity.io/goto/rF45nFINg?orgId=1
Without https://grafana.teleport.parity.io/goto/bvu5nKINR?orgId=1

Thx!

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Dec 24, 2024

Why are we passing --reserved-only in the first place? Even with the network starter present finalization is lagging behind. On average only every second/third block is finalized: #0, #3, #5, #8, #10, and so on. This is likely because the validators are only syncing blocks through the bootnode, because they are only connected to it on the sync protocol.

Without --reserved-only all the 14 nodes (bootnode + 13 validators) are connected to each other on the sync protocol, and in this case every block is finalized: #0, #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, and so on.

@dmitry-markin
Copy link
Contributor

What happens here is that due to removed network starter networking starts first and connects to reserved peers on default (sync) protocol before grandpa-voter task finishes initialization, and the latter misses SyncEvent::PeerConnected(peer_id) event.

Because we abuse the reserved nodes mechanism on non-default protocols to track sync peers, GossipEngine in grandpa listens to SyncEvent::PeerConnected(peer_id) events to add such peers as reserved on /grandpa/1 notifications protocol. And because grandpa-voter task is not ready yet to receive sync events, it misses this event and does not add the peer as reserved effectively never connecting to it on /grandpa/1 notifications protocol.

When --reserved-only is not specified, networking also discovers peers later via Kademlia, and for those peers grandpa's GossipEngine is ready to receive sync events.

This issue can potentially manifest in the wild if a node is configured to connect only to --reserved-nodes, and connections are established fast enough for additional tasks to not finish the initialization.

@dmitry-markin
Copy link
Contributor

The issue is fixed by #7011.

github-merge-queue bot pushed a commit that referenced this issue Dec 30, 2024
To simplify debugging of peerset related issues like
#6573 (comment).

---------

Co-authored-by: command-bot <>
github-merge-queue bot pushed a commit that referenced this issue Dec 30, 2024
Introduce `SyncEvent::InitialPeers` message sent to new subscribers to
allow them correctly tracking sync peers. This resolves a race condition
described in
#6573 (comment).

Fixes #6573.

---------

Co-authored-by: command-bot <>
Co-authored-by: Bastian Köcher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T10-tests This PR/Issue is related to tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants