Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent potentially misleading error message about "no heartbeat" #2180

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

okurz
Copy link
Contributor

@okurz okurz commented May 16, 2024

In the context of
github.com/os-autoinst/openQA
we observed error messages coming from Mojo::Server::Prefork
when otherwise there is no harmful effect observed. As the underlying
prefork module should only inform about a potentially erroneous
situation but can not know about the effect on any component that uses
the prefork module the log message instead should not claim an error but
rather only inform. Hence this commit reduces the log level from error
to info accordingly.

Related progress issue: https://progress.opensuse.org/issues/138536

In the context of
github.com/os-autoinst/openQA
we observed error messages coming from Mojo::Server::Prefork
when otherwise there is no harmful effect observed. As the underlying
prefork module should only inform about a potentially erroneous
situation but can not know about the effect on any component that uses
the prefork module the log message instead should not claim an error but
rather only inform. Hence this commit reduces the log level from error
to info accordingly.

Related progress issue: https://progress.opensuse.org/issues/138536
@guest20
Copy link

guest20 commented May 16, 2024

Is there a reason it's not possible for y'all to extended the timeout it with the attribute "heartbeat_timeout" in Mojo::Server::Prefork since your application requires it?

@okurz
Copy link
Contributor Author

okurz commented May 16, 2024

Is there a reason it's not possible for y'all to extended the timeout it with the attribute "heartbeat_timeout" in Mojo::Server::Prefork since your application requires it?

yes, we already use 900s. As far as I understood the evaluation in https://progress.opensuse.org/issues/106759 and https://progress.opensuse.org/issues/138536 regarding how this is used within the context of github.com/os-autoinst/openQA/ we will run into the timeout regardless. Probably something in the implementation is sub-optimal but anyway I still think the message with a log level of "error" is too strict

@kraih
Copy link
Member

kraih commented May 16, 2024

Why info over warn?

@kraih kraih requested review from a team, marcusramberg, kraih, christopherraa and Grinnz May 16, 2024 13:29
@okurz
Copy link
Contributor Author

okurz commented May 16, 2024

Why info over warn?

I would say same reasoning because I think this component can not know if the condition is harmful to the caller but if you guys prefer "warn" that would also help us.

@guest20
Copy link

guest20 commented May 27, 2024

Am I misunderstanding what the log message means?

Something blocked the worker for more than the heartbeat timeout and the supervisor has decided to kill that worker, causing anybody with a pending request to that worker to never receive any more bytes in response... Is that the situation this message warns about?

@kraih
Copy link
Member

kraih commented Jun 13, 2024

Am I misunderstanding what the log message means?

Something blocked the worker for more than the heartbeat timeout and the supervisor has decided to kill that worker, causing anybody with a pending request to that worker to never receive any more bytes in response... Is that the situation this message warns about?

Yes, it's also in the FAQ. https://docs.mojolicious.org/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-50-seconds-restarting-mean

@guest20
Copy link

guest20 commented Jun 13, 2024

@kraih Cool, though I don't see how "I'm about to TERM a blocking worker" could ever be part of normal/safe operation?

Is there any way for the dying worker to log the requests it was stuck on when the TERM arrived? The OpenQA ticket looks like it would be a much shorter investigation if that info were in the logs...

Also, would it be worth suggesting moving the blocking code into a Mojo::IOLoop->subprocess or a Minion entry to allow the requests/heartbeats to continue while doing the long-running work?

@kraih
Copy link
Member

kraih commented Jun 13, 2024

Is there any way for the dying worker to log the requests it was stuck on when the TERM arrived? The OpenQA ticket looks like it would be a much shorter investigation if that info were in the logs...

You're welcome to try and come up with a proposal for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants