Fail to schedule on available workers (typically occurs when a bad job is submitted.) #685

mimran-stripe · 2024-07-05T22:10:01Z

mimran-stripe
Jul 5, 2024

We're seeing some behavior where when a bad job is submitted (e.g. if it is crashing due to to a NullPointerException) other jobs are not getting adequately scheduled. Our list of possible workers renders ~0 -- Our "DEFAULT_CLUSTER" returns the following response:

{                                                                                                                                                                          
  "numRegisteredTaskExecutors": 58,                                                                                                                                        
  "numAvailableTaskExecutors": 0,                                                                                                                                          
  "numOccupiedTaskExecutors": 28,                                                                                                                                          
  "numAssignedTaskExecutors": 0,                                                                                                                                           
  "numDisabledTaskExecutors": 30                                                                                                                                           
}

but we certainly have less than 28 instances attempting to spin up.

We usually see this behavior after we've determined one job is crashlooping (crashes, and Mantis attempts to reschedule).

We've been able to resolve this in the past by manually restarting instances of mantis-server-agent. Do you have any suggestions on how to avoid this failure?

Andyz26 · 2024-07-10T17:44:44Z

Andyz26
Jul 10, 2024
Maintainer

@mimran-stripe how does your agent behave when a job is unassigned/crashed? Two things worth looking at here:

is the jobActor on the control plane submitting new workers to replace the crashed one?
were crashed workers/agents recovered and re-registered with the control plane as available?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to schedule on available workers (typically occurs when a bad job is submitted.) #685

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Fail to schedule on available workers (typically occurs when a bad job is submitted.) #685

mimran-stripe Jul 5, 2024

Replies: 1 comment

Andyz26 Jul 10, 2024 Maintainer

mimran-stripe
Jul 5, 2024

Andyz26
Jul 10, 2024
Maintainer