[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

larrylian · 2023-05-13T14:35:16Z

Core Motivation:

Improve the performance of batch calling ActorTask.
Implement Ray's native collective communication library through this interface. -- This will be discussed in a new REP afterwards.

Current situation of batch calling actor tasks:

actors = [WorkerActor.remote() for _ in range(400)]

# This loop's repeated invocation actually wastes a lot of performance.
for actor in actors:
  actor.compute.remote(args)

Using the new Batch Remote API:

actors = [WorkerActor.remote() for _ in range(400)]

# Calling it only once can greatly improve performance.  
Plan 1
batch_remote_handle = ray.experimental.batch_remote(actors)
batch_remote_handle.compute.remote(args)

Plan 2
batch_remote_handle = ray.experimental.BatchRemoteHandle(actors)
batch_remote_handle.compute.remote(args)

The Batch Remote API can save the following performance costs(The N is the number of Actors):

Reduce (N-1) times of parameter serialization performance time.
Reduce (N-1) times of putting parameter into object store performance time for scenarios with large parameters.
Reduce (N-1) times of python and C++ execution layer switching and repeated parameter verification performance time.

[WIP][Core]Add batch remote api for batch submit actor task
ray-project/ray#35597

jovany-wang

Please add the ad/dis sections.

jovany-wang · 2023-05-15T06:32:10Z

reps/2023-05-10-actors-batch-remote-api.md

+### General Motivation
+**Core Motivation**:
+1. Improve the performance of batch calling ActorTask.
+2. Implement Ray's native collective communication library through this interface.


We should split this REP into 2:

batch remote in ray core REP

introducing RAY_NATIVE mode in ray collective lib REP

Yes, This REP just adds Batch Rmote API.

jovany-wang · 2023-05-15T06:34:17Z

reps/2023-05-10-actors-batch-remote-api.md

+## Summary
+### General Motivation
+**Core Motivation**:
+1. Improve the performance of batch calling ActorTask.


We should introduce the bottleneck aspects, such as frequent context switching between python and cpp, serializing the same object in multiple times.

jovany-wang · 2023-05-15T06:39:34Z

reps/2023-05-10-actors-batch-remote-api.md

+
+
+### Should this change be within `ray` or outside?
+This requires adding a new interface in Ray Core.


I believe this should be added as an experimental & internal API first:

ray.experimental._batch_remote(actors).compute.remote(args)

Yes. At first, it should be placed in the experimental module.

jovany-wang · 2023-05-15T06:46:20Z

I guess this REP also benefits RLlib's sampling and weights-syncing aspects.

@gjoliver CC

Signed-off-by: 稚鱼 <[email protected]>

jjyao

A high level comment: can we do the optimization transparently without introducing a new api. For example, we cache the last args we see, if the new arg is the same, we reuse the previously serialized obj ref.

reps/2023-05-10-actors-batch-remote-api.md

larrylian · 2023-05-16T02:08:15Z

A high level comment: can we do the optimization transparently without introducing a new api. For example, we cache the last args we see, if the new arg is the same, we reuse the previously serialized obj ref.

@jjyao
The method you mentioned can only reduce the performance cost of repeatedly serializing parameters. However, the performance cost of frequent switching context between python and c++ cannot be avoided.
You can see my performance test even without parameters. Batch remote API performance optimization can reach 40%.

jovany-wang · 2023-05-22T03:32:43Z

Following up on the last offline meeting, please:

add a POC implementation to show the code structure.
update the failure section.

larrylian · 2023-05-22T14:25:39Z

@scv119 @jjyao

I have added a POC implementation PR to show the code strucature - [WIP][Core]Add batch remote api for batch submit actor task ray#35597
I have added a performance comparison chart with an objectRef as a parameter. In the case of using an objectRef, the batch remote approach shows a 3-4 times improvement in performance.
I have added section of "Failure & Exception Scenario"

cc @jovany-wang

Signed-off-by: 稚鱼 <[email protected]>

scv119 · 2023-06-16T18:22:29Z

reps/2023-05-10-actors-batch-remote-api.md

+Plan 1
+```
+batch_remote_handle = ray.experimental.batch_remote(actors)
+batch_remote_handle.compute.remote(args)


what happens if one of the actor failed? (i.e. killed or terminated?)

Failure & Exception Scenario.
1. Exceptions occurred during parameter validation or preprocessing before batch submission of ActorTasks.
Since these exceptions occur before the process of submitting ActorTasks, they can be handled by directly throwing specific error exceptions as current situation.

2. Some actors throw exceptions during the process of batch submitting ActorTasks.
When traversing and submitting ActorTasks in a loop, if one of the Actors throws an exception during submission, the subsequent ActorTasks will be terminated immediately, and the exception will be throwed to user.

Reason:
Submitting ActorTask is normally done without any exceptions being thrown. If an error does occur, it is likely due to issues with the code and will require modifications.
The exception behavior of this plan is the same as the current foreach remote.

rkooo567 · 2023-06-20T22:40:59Z

Reduce (N-1) times of python and C++ execution layer switching and repeated parameter verification performance time.

Is there any flamegraph that backs up this? IIUC, the verification is done in Cython, and there's no such thing as "switching" (cython is just C).

larrylian · 2023-06-21T02:48:30Z

Is there any flamegraph that backs up this? IIUC, the verification is done in Cython, and there's no such thing as "switching" (cython is just C).
@rkooo567
This "switching" refers to the execution context switch between python and (cython & c++).
I have confirmed through verification that after using BatchRemote optimization, the performance can be improved by 2~3 times in the scenario without parameters.

rkooo567 · 2023-06-28T23:03:49Z

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

jovany-wang · 2023-08-25T07:26:17Z

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

The most frequent context switching is with gil() and without gil().

TRSWNCA · 2024-05-20T11:15:59Z

(it is not a blocker); That doesn't prove it is context switching cost though. I feel like it is sth else. Cython is just C code, so there should be no such things as Python <-> cpp context switching IIUC. It is different from Java <-> CPP?

I suppose the context switching happens when for loops continue?
Original way use for loop in python, while if we provide a interface, the for loop would be in Ray Core in C.

larrylian force-pushed the batch_remote branch from b6e89c6 to 2d92731 Compare May 13, 2023 14:40

larrylian requested review from scv119, valiantljk, wumuzi520 and jovany-wang and removed request for valiantljk May 13, 2023 14:42

larrylian mentioned this pull request May 13, 2023

Ray-native collective communication library ray-project/ray#35311

Open

larrylian force-pushed the batch_remote branch from 2d92731 to 5065781 Compare May 13, 2023 15:07

jovany-wang reviewed May 15, 2023

View reviewed changes

Add a new interface for submitting actor tasks in batches(Batch Remote)

a4002b6

Signed-off-by: 稚鱼 <[email protected]>

larrylian force-pushed the batch_remote branch from 5065781 to a4002b6 Compare May 15, 2023 08:54

jjyao reviewed May 16, 2023

View reviewed changes

reps/2023-05-10-actors-batch-remote-api.md Outdated Show resolved Hide resolved

larrylian assigned scv119 May 17, 2023

larrylian mentioned this pull request May 22, 2023

[WIP][Core]Add batch remote api for batch submit actor task ray-project/ray#35597

Open

8 tasks

larrylian requested a review from jjyao May 22, 2023 14:25

larrylian force-pushed the batch_remote branch from deda6cf to d5d23b1 Compare May 22, 2023 14:27

larrylian force-pushed the batch_remote branch from d5d23b1 to 8cd074e Compare June 16, 2023 02:56

Add failure & exception scenario

5584cbb

Signed-off-by: 稚鱼 <[email protected]>

larrylian force-pushed the batch_remote branch from 8cd074e to 5584cbb Compare June 16, 2023 02:57

scv119 reviewed Jun 16, 2023

View reviewed changes

larrylian requested a review from scv119 June 21, 2023 02:47

Add api usage scenarios

14941e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

larrylian commented May 13, 2023 •

edited

Loading

jovany-wang left a comment

jovany-wang May 15, 2023

larrylian May 15, 2023

jovany-wang May 15, 2023

jovany-wang May 15, 2023

larrylian May 15, 2023

jovany-wang commented May 15, 2023

jjyao left a comment

larrylian commented May 16, 2023

jovany-wang commented May 22, 2023

larrylian commented May 22, 2023

scv119 Jun 16, 2023

larrylian Jun 21, 2023

rkooo567 commented Jun 20, 2023 •

edited

Loading

larrylian commented Jun 21, 2023

rkooo567 commented Jun 28, 2023 •

edited

Loading

jovany-wang commented Aug 25, 2023

TRSWNCA commented May 20, 2024



		### Should this change be within `ray` or outside?
		This requires adding a new interface in Ray Core.

[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

Are you sure you want to change the base?

[Core] Add a new interface for submitting actor tasks in batches (Batch Remote) #31

Conversation

larrylian commented May 13, 2023 • edited Loading

jovany-wang left a comment

Choose a reason for hiding this comment

jovany-wang May 15, 2023

Choose a reason for hiding this comment

larrylian May 15, 2023

Choose a reason for hiding this comment

jovany-wang May 15, 2023

Choose a reason for hiding this comment

jovany-wang May 15, 2023

Choose a reason for hiding this comment

larrylian May 15, 2023

Choose a reason for hiding this comment

jovany-wang commented May 15, 2023

jjyao left a comment

Choose a reason for hiding this comment

larrylian commented May 16, 2023

jovany-wang commented May 22, 2023

larrylian commented May 22, 2023

scv119 Jun 16, 2023

Choose a reason for hiding this comment

larrylian Jun 21, 2023

Choose a reason for hiding this comment

rkooo567 commented Jun 20, 2023 • edited Loading

larrylian commented Jun 21, 2023

rkooo567 commented Jun 28, 2023 • edited Loading

jovany-wang commented Aug 25, 2023

TRSWNCA commented May 20, 2024

larrylian commented May 13, 2023 •

edited

Loading

rkooo567 commented Jun 20, 2023 •

edited

Loading

rkooo567 commented Jun 28, 2023 •

edited

Loading