-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the ad/dis sections.
### General Motivation | ||
**Core Motivation**: | ||
1. Improve the performance of batch calling ActorTask. | ||
2. Implement Ray's native collective communication library through this interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should split this REP into 2:
- batch remote in ray core REP
- introducing RAY_NATIVE mode in ray collective lib REP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, This REP just adds Batch Rmote API.
## Summary | ||
### General Motivation | ||
**Core Motivation**: | ||
1. Improve the performance of batch calling ActorTask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should introduce the bottleneck aspects, such as frequent context switching between python and cpp, serializing the same object in multiple times.
|
||
|
||
### Should this change be within `ray` or outside? | ||
This requires adding a new interface in Ray Core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be added as an experimental & internal API first:
ray.experimental._batch_remote(actors).compute.remote(args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. At first, it should be placed in the experimental module.
I guess this REP also benefits RLlib's sampling and weights-syncing aspects. @gjoliver CC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high level comment: can we do the optimization transparently without introducing a new api. For example, we cache the last args we see, if the new arg is the same, we reuse the previously serialized obj ref.
@jjyao |
Following up on the last offline meeting, please:
|
cc @jovany-wang |
Signed-off-by: 稚鱼 <[email protected]>
Plan 1 | ||
``` | ||
batch_remote_handle = ray.experimental.batch_remote(actors) | ||
batch_remote_handle.compute.remote(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if one of the actor failed? (i.e. killed or terminated?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure & Exception Scenario.
1. Exceptions occurred during parameter validation or preprocessing before batch submission of ActorTasks.
Since these exceptions occur before the process of submitting ActorTasks, they can be handled by directly throwing specific error exceptions as current situation.
2. Some actors throw exceptions during the process of batch submitting ActorTasks.
When traversing and submitting ActorTasks in a loop, if one of the Actors throws an exception during submission, the subsequent ActorTasks will be terminated immediately, and the exception will be throwed to user.
Reason:
Submitting ActorTask is normally done without any exceptions being thrown. If an error does occur, it is likely due to issues with the code and will require modifications.
The exception behavior of this plan is the same as the current foreach remote.
Is there any flamegraph that backs up this? IIUC, the verification is done in Cython, and there's no such thing as "switching" (cython is just C). |
|
(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP? |
The most frequent context switching is |
I suppose the context switching happens when for loops continue? |
Core Motivation:
Current situation of batch calling actor tasks:
Using the new Batch Remote API:
The Batch Remote API can save the following performance costs(The N is the number of Actors):
[WIP][Core]Add batch remote api for batch submit actor task
ray-project/ray#35597