Fail to schedule on available workers (typically occurs when a bad job is submitted.) #685
Replies: 1 comment
-
@mimran-stripe how does your agent behave when a job is unassigned/crashed? Two things worth looking at here:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We're seeing some behavior where when a bad job is submitted (e.g. if it is crashing due to to a NullPointerException) other jobs are not getting adequately scheduled. Our list of possible workers renders ~0 -- Our "DEFAULT_CLUSTER" returns the following response:
but we certainly have less than 28 instances attempting to spin up.
We usually see this behavior after we've determined one job is crashlooping (crashes, and Mantis attempts to reschedule).
We've been able to resolve this in the past by manually restarting instances of
mantis-server-agent
. Do you have any suggestions on how to avoid this failure?Beta Was this translation helpful? Give feedback.
All reactions