-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270
base: master
Are you sure you want to change the base?
Conversation
…n parent stage, might leave behind zombie tasks
cc @cloud-fan |
@@ -2937,7 +2937,9 @@ private[spark] class DAGScheduler( | |||
} else { | |||
// This stage is only used by the job, so finish the stage if it is running. | |||
val stage = stageIdToStage(stageId) | |||
if (runningStages.contains(stage)) { | |||
val isRunningStage = runningStages.contains(stage) || | |||
(waitingStages.contains(stage) && taskScheduler.hasRunningTasks(stageId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we just kill all waiting stages? Does taskScheduler.killAllTaskAttempts
handle it well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and can we add a special flag to indicate the waiting stages that are submitted due to retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we just kill all waiting stages? Does taskScheduler.killAllTaskAttempts handle it well?
I tested: if the normally generated waiting stages call killAllTaskAttempts
, the stage status will be displayed as FAILED
, which was SKIPPED
before, killAllTaskAttempts
itself will not go wrong.
Always kill waiting stage seems to be a safer approach (tasks really shouldn't be run anymore), but it may generate unnecessary stageFailed
events compared with before.
and can we add a special flag to indicate the waiting stages that are submitted due to retry?
Yes, we can add a flag , please see the update codes.
Actually , there is a trade-off here to kill waiting stages, the range of choices from large to small :
- kill all waiting stage
- kill waiting stage when had failed (
stage#failedAttemptIds
> 0) - kill waiting stage when retry in fetch failed (
stage#resubmitInFetchFailed
) - kill waiting stage which only has running tasks (this might not be enough? )
This is a good catch! cc @jiangxb1987 @Ngone51 |
@@ -2248,7 +2250,7 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti | |||
// original result task 1.0 succeed | |||
runEvent(makeCompletionEvent(taskSets(1).tasks(1), Success, 42)) | |||
sc.listenerBus.waitUntilEmpty() | |||
assert(completedStage === List(0, 1, 1, 0)) | |||
assert(completedStage === List(0, 1, 1, 0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last one added is the resubmit stage (FetchFailed) and in waiting stages. We will kill it and one more SparkListenerStageCompleted event will be added ( see markStageAsFinished)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changing to judging by failedAttemptIds
, it won't cancel. Because all tasks finished already , in markStageAsFinished
will remove failedAttemptIds
if no error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yabola I'm curious about the difference here. With the current approach, doesn't the stage still has to be killed because of failedAttemptIds.nonEmpty
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain the timeline of the last event in this UT:
- map stage is running, result stage is waiting (result task 1.0 is running)
- UT result task 1.0 success(no running tasks in result stage any more)
- in
handleTaskCompletion
, the result stagemarkStageAsFinished
and clean result stage'sfailedAttemptIds
cancelRunningIndependentStages
cancel map stage (it is in running stage) . Result stage is waiting , but don't havefailedAttemptIds
, so it won't be killed incancelRunningIndependentStages
(and also no running tasks in result stage)
In this UT, it is really no need to kill the last result stage.
In addition, the result stage will always definitely kill all tasks when success, we don't have to worry about this. please see here
taskScheduler.killAllTaskAttempts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good to me.
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala
Outdated
Show resolved
Hide resolved
@@ -2937,7 +2938,9 @@ private[spark] class DAGScheduler( | |||
} else { | |||
// This stage is only used by the job, so finish the stage if it is running. | |||
val stage = stageIdToStage(stageId) | |||
if (runningStages.contains(stage)) { | |||
val shouldKill = runningStages.contains(stage) || | |||
(waitingStages.contains(stage) && stage.resubmitInFetchFailed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check the failedAttemptIds
instead?
(waitingStages.contains(stage) && stage.resubmitInFetchFailed) | |
stage.failedAttemptIds.nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can Please see this
I'm not sure which way is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @Ngone51's suggestion better - simply check for stage.failedAttemptIds.nonEmpty || runningStages.contains(stage)
.
I can see an argument being made for failed as well.
With this, the PR will boil down to this change and tests to stress this logic ofcourse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only considering failedAttemptIds may result in repeated calls to the the stage already completed and failed.
It looks like there could be a case where the stage exists in failedStages
but not in waitingStages
, e.g., in the case of fetch failures, map stage and reduce stage can be added into failedStages
, but the related job could be canceled before they were resubmitted. So adding waitingStages.contains(stage)
would miss the stages in failedStages
. And I don't think we would have repeated calls as we don't kill tasks for those failed stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the confirmation, done
What changes were proposed in this pull request?
This is a problem that Spark always had. See the following section for the scenario when the problem occurs.
When cancel a job, some tasks may be still running.
The reason is that when
DAGScheduler#handleTaskCompletion
encounters FetchFailed,markStageAsFinished
will be called to remove the stage inDAGScheduler#runningStages
(seespark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Line 2059 in 7cd5c4a
killAllTaskAttempts
.But
DAGScheduler#cancelRunningIndependentStages
only findrunningStages
, this will leave zombie shuffle tasks, occupying cluster resources.Why are the changes needed?
Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, the stage1 and stage2 will resubmit (stage2 may still have some tasks running even if stage2 is resubmitted , this is as expected, because these tasks may eventually succeed and avoid retry)
But during the execution of the stage1-retry , if the SQL is canceled, the tasks in stage1 and stage1-retry can all be killed, but the tasks previously running in stage2 are still running and can't be killed. These tasks can greatly affect cluster stability and occupy resources.
Does this PR introduce any user-facing change?
No
Was this patch authored or co-authored using generative AI tooling?
No