Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add labels mechanism and actor-affinity-feature proposal #13

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

larrylian
Copy link
Contributor

@larrylian larrylian commented Sep 2, 2022

1. Labels mechanism.

Introduce the Labels mechanism. Give Labels to Actors/Nodes.
Affinity features such as ActorAffinity/NodeAffinity can be realized through Labels.

Read reps/2022-11-23-labels-mechanish-and-affinity-schedule-feature.md first.

2. Actor-Affinity Feature

Provides a set of lightweight actor affinity and anti-affinity scheduling interfaces.
Replacing the heavy PG interface to implement collocate or spread actors.

  • Affinity
    • Co-locate the actors in the same node.
    • Co-locate the actors in the same batch of nodes, like nodes in the same zones
  • Anti-affinity
    • Spread the actors of a service across nodes and/or availability zones, e.g. to reduce correlated failures.
    • Give a actor "exclusive" access to a node to guarantee resource isolation
    • Spread the actors of different services that will affect each other on different nodes.

ActorAffinity调度

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This closely matches the k8s pod affinity/anti-affinity (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity).

We may want label based node affinity/anti-affinity as well in the future but I think it can be solved separately.

reps/2022-08-31-actor-affinity-apis.md Outdated Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Outdated Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Show resolved Hide resolved
reps/2022-08-31-actor-affinity-apis.md Show resolved Hide resolved
@jovany-wang
Copy link
Contributor

For failure section, we should add more discussions on how we promise the scheduling result consensus.

I'd also like to know if it's possible to achieve the following typical goals:
distribute one actor for each node, also it could be promised that every nodes has 1 actor even a new node joined or exited.

@larrylian
Copy link
Contributor Author

For failure section, we should add more discussions on how we promise the scheduling result consensus.

I'll add some more failure scenarios

I'd also like to know if it's possible to achieve the following typical goals: distribute one actor for each node, also it could be promised that every nodes has 1 actor even a new node joined or exited.

This feature is too complicated, and needs to be discussed, so it is not implemented in this feature.

@larrylian larrylian force-pushed the actor_affinity branch 2 times, most recently from 3616191 to aacbbb4 Compare November 16, 2022 11:39
@larrylian larrylian force-pushed the actor_affinity branch 2 times, most recently from e9a9783 to 06fc962 Compare November 28, 2022 09:36
@larrylian larrylian changed the title Add actor-affinity-feature proposal Add labels mechanish and actor-affinity-feature proposal Nov 28, 2022
@stephanie-wang
Copy link
Contributor

stephanie-wang commented Jan 3, 2023

Hmm I agree with @ericl, I'm concerned the current scope of the REP is too broad. If we want to go ahead with full labels and expressions, we need deeper design review for the API.

For the use cases listed, I think we could pare down this REP to only allowing passing in other actor handles and/or custom resources to the scheduling strategy, and just use flags for affinity/anti-affinity instead of expressions. Of course we could always improve it later to add labels/expressions once we have more information from users. How does that sound?

@larrylian
Copy link
Contributor Author

larrylian commented Jan 6, 2023

Give Labels to Actors/Nodes

I have concerns about adding labels to nodes, since we already have the notion of custom resources. Any reason not to use custom resources there?

  1. NodeAffinity may look similar to custom resource. But ActorAffinity is completely different. Our REP will first implement ActorAffinity and then extend NodeAffinity.
  2. Adding Labels to Node looks similar to custom resource, but there is a big difference in reality. Custom resource is a kind of resource, which is of double type and can be used for resource counting. But Labels are attributes or tags. He is not a double type and is not used for counting.
  3. Add labels to node is another simplification and supplement to custom resource.
    Labels are more user-friendly in many scenarios and easier for users to understand.
    for example:
    If the nodes of different AZs are isolated using custom resources,
    the node setting of "AZ-1" is {"AZ-1": 9999} and the node setting of "AZ-2" is {"AZ-2": 9999}.
    Then Actor set resource{"AZ-1": 1} to scheduling.
    The count in custom_resource is actually redundant when scheduling actors. Users will be confused.
    But use labels and NodeAffinity.
    You can directly set labels {"AZ": "AZ-1"} ,{ "AZ": "AZ-2"} to Node.
    Then when the user schedules actors, use NodeAffinity affinity {"AZ": "AZ-1"}.

@larrylian
Copy link
Contributor Author

Hmm I agree with @ericl, I'm concerned the current scope of the REP is too broad. If we want to go ahead with full labels and expressions, we need deeper design review for the API.

For the use cases listed, I think we could pare down this REP to only allowing passing in other actor handles and/or custom resources to the scheduling strategy, and just use flags for affinity/anti-affinity instead of expressions. Of course we could always improve it later to add labels/expressions once we have more information from users. How does that sound?

This REP introduces the concept of labels, which is quite different from custom_resource. This labels will greatly improve the scheduling of ray.
At a high level, we have reached an agreement with @scv119 @iycheng on the introduction of labels and affinity scheduling, and we think this is a good idea.

@larrylian larrylian requested a review from ericl January 6, 2023 07:51
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need a section when scheduling cannot happen. More concretely, what's the behavior when

  1. resources are enough, but affinity cannot be satisfied?
  2. affinity can be satisfied, but resources are not enough?
  3. both affinity and resources are not satisfied?

Do we throw an exception? Will it just hang? If so, how will this be exposed to users?

This also requires how this API will play with autoscaler. (if it can't satisfy the resource request now, but it can after autoscaling up?)

@ericl
Copy link
Contributor

ericl commented Jan 6, 2023

Thanks for the comments. Overall, I agree with the value of adding node labels for more straightforward and powerful affinity/anti-affinity scheduling.

I do think this is a pretty big change--- perhaps the biggest change to Ray scheduling since placement groups. The scope and design of the change are my main concerns. I don't think the current REP sufficiently outlines the changes necessary to fully support this feature in Ray.

So here's my proposal, we split this REP into two separate REPs:

  1. Introducing the label mechanism for nodes. This should cover:
    • The proposed API for ray start, scheduling_strategy.
    • Ray autoscaler support for labels (including node config, changes to node selection strategy, infeasible error messages).
  2. Extension of the label mechanism for actor affinity.
    • The proposed API.
    • How to support this in the Ray autoscaler.

Thoughts? Also cc @wuisawesome

@stephanie-wang
Copy link
Contributor

Thanks, @larrylian, I see the value of having labels, but I think we should try to get consensus on a more minimal short-term API change that will still be useful. Once we expand the scheduling API it becomes very difficult to go back, so we need to focus our efforts on APIs that will fit common use cases.

What do you think about the following, and we can split these into one REP each? These will likely not cover all of the use cases that you mentioned, but I believe they are uncontroversial and will already add value.

  1. Scheduling strategies with static node labels (can only be set at node start)
  2. ActorAffinity scheduling strategy using actor handles - the reasoning here is that using labels for actor affinity scheduling actually adds an additional unnecessary abstraction that users have to think about. For most users, I think the most intuitive API is to just specify actor handles to co-locate with or avoid.

Long-term, we can think about the following solutions, and for now just summarize the key use cases:

  1. Dynamic labels / labels created by actor scheduling.
  2. More advanced/flexible expressions in scheduling strategies.

@wuisawesome
Copy link

So here's my proposal, we split this REP into two separate REPs:

Introducing the label mechanism for nodes. This should cover:
The proposed API for ray start, scheduling_strategy.
Ray autoscaler support for labels (including node config, changes to node selection strategy, infeasible error messages).
Extension of the label mechanism for actor affinity.
The proposed API.
How to support this in the Ray autoscaler.
Thoughts? Also cc @wuisawesome

I agree. The first part is very similar to the k8s API and seems clean and uncontroversial.

The second part (actor affinity) seems significantly more complicated and risky.

1. Tag the Actor with key-value labels first
2. Then affinity or anti-affinity scheduling to the actors of the specified labels.

![ActorAffinityScheduling](https://user-images.githubusercontent.com/11072802/188054945-48d980ed-2973-46e7-bf46-d908ecf93b31.jpg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any examples of existing systems that have this ActorAffinity concept? Or was there any existing inspiration for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ActorAffinity is similar to podAffinity/NodeAffinity of k8s.

@larrylian
Copy link
Contributor Author

larrylian commented Jan 11, 2023

I believe we need a section when scheduling cannot happen. More concretely, what's the behavior when

  1. resources are enough, but affinity cannot be satisfied?
  2. affinity can be satisfied, but resources are not enough?
  3. both affinity and resources are not satisfied?

Do we throw an exception? Will it just hang? If so, how will this be exposed to users?

This also requires how this API will play with autoscaler. (if it can't satisfy the resource request now, but it can after autoscaling up?)

Thanks for the good question. For this abnormal scene, I will add a chapter to REP.

Q1:

  1. resources are enough, but affinity cannot be satisfied? -> Hang & report schedule failed event and detail unstaisfed reason to exposed to users
  2. affinity can be satisfied, but resources are not enough? -> Hang & report schedule failed event and detail unstaisfed reason to exposed to users
  3. both affinity and resources are not satisfied? -> Hang & report schedule failed event and detail unstaisfed reason to exposed to users

Q2:
I don't think this is a big problem. This ActorAffinity/NodeAffinity is just a new scheduling strategy similar to the previous SPEAD/DEAFULT/PG.
Now it only needs to be adapted according to the implementation of other scheduling strategies.

@larrylian
Copy link
Contributor Author

I agree. The first part is very similar to the k8s API and seems clean and uncontroversial.

The second part (actor affinity) seems significantly more complicated and risky.

ActorAffinity is the specific implementation feature discussed in the first part.
It is similar to PodAffinity/NodeAffinity of K8s.
It doesn't get any more complicated. You think it is complicated because k8s is described in yaml, but now it is described in python.

@larrylian
Copy link
Contributor Author

larrylian commented Jan 11, 2023

I don't think the current REP sufficiently outlines the changes necessary to fully support this feature in Ray.

What is mentioned now is that ActorAffinity/NodeAffinity is actually part of PG 2.0.
Now only PG can be used to achieve the ability of Affinity. However, PG has the ability of affinity, resource reserve, and atomic scheduling at the same time, so that its use method is too complicated and the implementation of PG is also very complicated. It will often cause problems under high-load use of large clusters, and its performance is not good.
So now ActorAffinity is used to replace PG's Affinity ability alone, which is more pure and simple to use.
I will add this to the REP.

Q2:

So here's my proposal, we split this REP into two separate REPs:

I am now actually divided into two REPs for discussion. Only now both REPs are in the same PR.
2022-11-23-labels-mechanish-and-affinity-schedule-feature.md -> Introducing the label mechanism for nodes. and ActorAffiinty & NodeAffinity High level discussion and API form.
2022-08-31-actor-affinity-apis.md -> Extension of the label mechanism for actor affinity. API 、abnormal behavior and implencement

Q3: autoscale

That's a good question.
I think this problem is not difficult to solve.
This ActorAffinity/NodeAffinity is just a new scheduling strategy similar to the previous SPEAD/DEAFULT/PG.
Now it only needs to be adapted according to the implementation of other scheduling strategies.

@larrylian
Copy link
Contributor Author

  1. Scheduling strategies with static node labels (can only be set at node start)
  2. ActorAffinity scheduling strategy using actor handles - the reasoning here is that using labels for actor affinity scheduling actually adds an additional unnecessary abstraction that users have to think about. For most users, I think the most intuitive API is to just specify actor handles to co-locate with or avoid.

Long-term, we can think about the following solutions, and for now just summarize the key use cases:

  1. Dynamic labels / labels created by actor scheduling.
  2. More advanced/flexible expressions in scheduling strategies.

Thanks for your very valuable opinion.

Q1:ActorAffinity scheduling strategy using actor handles

This is a very good idea, so I added the syntactic sugar of using actor_handle to solution 1.

actor_0 = Actor.remote()
actor_1 = Actor.options(
        scheduling_strategy=actor_affinity(affinity(actor_0, false))
    ).remote()

actor_2 = Actor.options(
        scheduling_strategy=actor_affinity(anti_affinity(actor_0, false))
    ).remote()

Q2: Scheduling strategies with static node labels (can only be set at node start)

For Node's Label, I also intend to do the same. The first step is to use a static label. Then expand the dynamic node label in the long term.
But the dynamic label of Actor. I'll do it on the first step. Because this is the basis of the actorAffinity feature.

Q3:

  1. Dynamic labels / labels created by actor scheduling.
  2. More advanced/flexible expressions in scheduling strategies.

This is my opinion is the first step to achieve.
Because my intention is to implement ActorAffinity first and then NodeAffinity.
Then I mentioned the Option 1 API. In fact, it is relatively simple. If you simply use actor_handle to express afiinity/anti_affinity. It is unsatisfactory in complex nested actor scenes.

@stephanie-wang
Copy link
Contributor

But the dynamic label of Actor. I'll do it on the first step. Because this is the basis of the actorAffinity feature.

Q3:

  1. Dynamic labels / labels created by actor scheduling.
  2. More advanced/flexible expressions in scheduling strategies.

This is my opinion is the first step to achieve. Because my intention is to implement ActorAffinity first and then NodeAffinity. Then I mentioned the Option 1 API. In fact, it is relatively simple. If you simply use actor_handle to express afiinity/anti_affinity. It is unsatisfactory in complex nested actor scenes.

Thanks for updating the REPs! But I don't quite understand why we need labels to implement ActorAffinity. It seems much easier to just implement actor affinity directly with handles:

schedule_with_affinity(actor, handles):
  locs = get_node_locs(handles)
  schedule_at_nodes(actor, best_loc(locs))

Scheduling with dynamic labels on the other hand introduces complexity around what happens if the label fails to be created due to actor failure, if the label is created by something other than the intended actor due to user error, how to make this work with autoscaling, etc. With dynamic labels, it seems like there will always be an issue of knowing how long to wait for a label to appear. I can see how the labels could help with scalability of actor affinity in the future but the added complexity doesn't seem worthwhile for a first version.

In the interest of supporting the desired feature ASAP (actor affinity scheduling), I suggest:

  1. We decouple the actor affinity REP from labels completely.
  2. Defer all discussion of dynamic labels (including ones created by actors) to a later REP.

@larrylian
Copy link
Contributor Author

larrylian commented Jan 13, 2023

Thanks for updating the REPs! But I don't quite understand why we need labels to implement ActorAffinity. It seems much easier to just implement actor affinity directly with handles:

schedule_with_affinity(actor, handles):
  locs = get_node_locs(handles)
  schedule_at_nodes(actor, best_loc(locs))

Scheduling with dynamic labels on the other hand introduces complexity around what happens if the label fails to be created due to actor failure, if the label is created by something other than the intended actor due to user error, how to make this work with autoscaling, etc. With dynamic labels, it seems like there will always be an issue of knowing how long to wait for a label to appear. I can see how the labels could help with scalability of actor affinity in the future but the added complexity doesn't seem worthwhile for a first version.

In the interest of supporting the desired feature ASAP (actor affinity scheduling), I suggest:

  1. We decouple the actor affinity REP from labels completely.
  2. Defer all discussion of dynamic labels (including ones created by actors) to a later REP.

Let me give you a few examples to help you understand why i think must use Label to achieve ActorAffinity. Help you understand the value of Label.

To sum up, there are three problems with only using actor_handle to implement ActorAffinity.

  1. When Actor FO is scheduled, there will be a problem that the scheduling does not meet expectations. - this is the worst problem
  2. In many scenes, it is impossible to obtain the actor handle, or it will be very complicated to obtain the actor handle.
  3. In a scene with many actors, it is very complicated to only use the actor handle. Many jobs now consist of hundreds or thousands of actors.

1. Actor FO:
Spead Actor to each Node. The way to spead actor with use only actor_handle is:

  1. Actor A normal schedule.
  2. Actor B schedule_with_anti_affinity(A)
  3. Actor C schedule_with_anti_affinity([A,B])
  4. If actor A occur FO, A will maybe schedule to node which B or C.
    Actor FO导致不是反亲和了

2. Can't obtain actor handle

Driver -> Actor A -> Actor B

When A creates B, he wants B and A to have an affinity. Satisfy localization. At this time, A cannot obtain its own handle.

A certain type of Actor in the multi-level Actor in the DAG wants to achieve affinity or anti-affinity.

Output Actors is created by Reduce Actors. But output actors want to co-locate。
In this scenario, it is difficult for a single reduce to obtain all input actor handles. But using a label to label the Output Actor with the same label can easily achieve Affinity/anti-affinity.
image

3. Many actors.
Many DAGs or jobs now have hundreds of actors. You have to fill in hundreds of actors at the same time, which is not good to use. And the performance is not good.

I have given several examples of using labels to achieve affinity in REP, you can take a look.
image

cc @scv119 @ericl @jovany-wang

@larrylian larrylian requested a review from scv119 January 13, 2023 10:14
@stephanie-wang
Copy link
Contributor

Thanks for providing the context, but I believe that these issues are not necessarily the common case for Ray users today so these seem like premature optimizations right now. It would be better to introduce the API first and from there gather feedback from the rest of the OSS community on usability/scalability before we go ahead with a potentially complex system. Do you see a problem with the approach of implementing a basic version using only actor handles first and later deciding whether/how to add a labels-based implementation?

In more detail:

  1. When Actor FO is scheduled, there will be a problem that the scheduling does not meet expectations. - this is the worst problem

Failover semantics is indeed something that should be covered in the REP but actually I don't see why you can't accomplish the same thing using only actor handles. The GCS should know exactly which actors are placed where, so it can simply rerun the affinity/anti-affinity policy on FO.

  1. In many scenes, it is impossible to obtain the actor handle, or it will be very complicated to obtain the actor handle.

It should be straightforward to support something like "get own handle" or "pass self to ActorAffinity strategy". Otherwise, the main problem is how difficult the API is to use, which we don't yet have evidence from the community on.

  1. In a scene with many actors, it is very complicated to only use the actor handle. Many jobs now consist of hundreds or thousands of actors.

Again, this API doesn't exist yet so we don't have evidence on usability. Also, there are very few use OSS use cases right now that are using this many actors.

@ericl
Copy link
Contributor

ericl commented Jan 13, 2023

Maybe let's think about one particular edge case: scheduling a lot of actors each with mutual anti-affinity. IIUC, the concern is this leads to O(n) sized constraints, such as:

a1 = A.options().remote()
a2 = A.options(anti_affinity=[a1]).remote()
a3 = A.options(anti_affinity=[a1, a2]).remote()
..
an = A.options(anti_affinity=[a1, a2, ..., an-1]).remote()

This is true, but I wonder why the user wouldn't use a single SPREAD or STRICT_SPREAD placement group in the first place to schedule these actors. There is a pending REP to substantially improve the placement groups API as well, to be more flexible.

I think the combination of enhanced placement groups, node labels, and actor handle affinity could be pretty powerful. If we can see common situations that can't be handled by these features together, I think we should try to dive more into those scenarios.

@larrylian
Copy link
Contributor Author

larrylian commented Jan 16, 2023

@stephanie-wang

Actor FO

a1 = A.options().remote()
a2 = A.options(anti_affinity=[a1]).remote()
a3 = A.options(anti_affinity=[a1, a2]).remote()
..
an = A.options(anti_affinity=[a1, a2, ..., an-1]).remote()

If a1 failover, a1 will maybe schedule to node of a2~an。

You can see Ericl's reply. If you only use actor-handle to implement anti-affinity. There will be this problem.
This problem is very fatal, and it will cause enterprise users who pay attention to stability to not consider using this feature at all.

  1. Can't obtain actor handle

Although you said that you can add "get own handle" to solve it. But this is just one scenario I exemplified. There are many other scenarios where actor handles cannot be easily obtained. For example, the DAG multi-level scenarios and cross-language calls we mentioned above. And if the number of Actors is relatively large, the code development cost for users to obtain actor_handle is very high.

  1. Also, there are very few use OSS use cases right now that are using this many actors.

The diagram of the "output" operator above is just an example. It could also be "map" or "reduce" etc. It may also be a topology map of other scenarios.

Now a job with hundreds of actors is actually very common. Our ray has developed for so many years now, I think we should pay more attention to these enterprise-level large-scale scenarios now.

  1. Do you see a problem with the approach of implementing a basic version using only actor handles first and later deciding whether/how to add a labels-based implementation?

It is because of the problems in many of the above scenarios, especially the Actor Failover scenario. As a result, only using actor_handle cannot meet the needs of users.

@larrylian
Copy link
Contributor Author

@ericl

  1. why the user wouldn't use a single SPREAD or STRICT_SPREAD placement group in the first place to schedule these actors.

PG is only suitable for scenarios where the number of Actors is known in advance. But now many jobs are long-time jobs, which will be dynamically scaled with time or traffic. The number of Actors it wants to aggregate is not fixed.

  1. There is a pending REP to substantially improve the placement groups API as well, to be more flexible.

Although if the PG is enhanced, the new PG resize function can solve the above problems. But it is also very complicated for users to use.
example:

  1. Set PG bundle 1 resource{CPU=10}, schedule 10 Actors(per actor use 1 CPU) to bundle 1.
  2. If the business needs to increase to 20 Actors.
  3. User must first resize pg bundles resource to {CPU=20}.
  4. Then schedule other 10 actors to bundle 1.
  5. if reduce to 5 actors at the time, users also need to resize pg resource to avoid wasting resources.

1 > For scenarios that only require Affinity/Anti-affinity functions, using PG requires users to manage PG resources by themselves. This will increase the complexity of use, which is not the original intention of ray.

2 > The current PG is already very complicated, and it will be more complicated to implement PG resize. At that time, performace, consistency, stability, etc. will have big problem.

3 > Now the implementation of PG needs to consider the three functions of affinity, resource reservation, and gang-schedule submitted in two phases at the same time, and the internal implementation is very complicated. This leads to many problems under high load usage of large-scale clusters:

  1. A PG has 100+ or 1000+ bundles. If a bundle fails to be created (and recreated too), it will cause the entire PG to fail to be created. Or the initialization of a certain node job is slow, which will cause the creation of the entire PG to become very slow. These problems may not appear in the scenario of several bundles. But it will often appear in large-scale enterprise-level clusters.
  2. Now PGs are created serially, which will cause the scheduling of one PG to hang, and cause other PGs to also get hang. This is also often encountered within us.

Because of these problems above and others. Our internal users often complain about the PG function. Then more and more users have given up using PG.
Then use the ActorAffinity implemented by Labels, whether it is the user's use or the internal implementation, it will be much simpler. Our internal ActorAffinity feature has been online for user to use for half a year, and there have been no problems so far.

@stephanie-wang
Copy link
Contributor

If a1 failover, a1 will maybe schedule to node of a2~an。

You can see Ericl's reply. If you only use actor-handle to implement anti-affinity. There will be this problem. This problem is very fatal, and it will cause enterprise users who pay attention to stability to not consider using this feature at all.

Thank you for the explanation, this makes sense.

  1. Do you see a problem with the approach of implementing a basic version using only actor handles first and later deciding whether/how to add a labels-based implementation?

It is because of the problems in many of the above scenarios, especially the Actor Failover scenario. As a result, only using actor_handle cannot meet the needs of users.

From my perspective, it seems like the issues with the actor handle implementation mainly come from large-scale multitenant Ray clusters with long-lived services. While I agree this is an important scenario in large-scale enterprise cases that we should pay more attention to, this is not yet the common case for most users, especially the multitenancy part. For example, the only major Ray library that partly fits this use case is Ray Serve, and AFAIK Serve does not have as hard requirements for actor affinity/anti-affinity.

To be clear, I am not against implementing the labels version; I just don't think it's the right thing to start with right now. If we start with a more minimal initial version, we can collect feedback for a more holistic solution (that could include the improved placement groups API that @ericl mentioned), and there is also less chance that we will paint ourselves into a corner with an insufficient API. It's possible we will end up converging on labels in the end anyway, but it's not clear to me yet what the alternatives are and how it will fit with other ongoing scheduling enhancements.

So, considering the current OSS use cases, do you see a problem with the approach of implementing a basic version using only actor handles first?

@zhe-thoughts zhe-thoughts changed the title Add labels mechanish and actor-affinity-feature proposal Add labels mechanism and actor-affinity-feature proposal Jan 18, 2023
@larrylian
Copy link
Contributor Author

larrylian commented Jan 22, 2023

@stephanie-wang

So, considering the current OSS use cases, do you see a problem with the approach of implementing a basic version using only actor handles first?

The solution of actor handle can only solve the simple actor affinity function. The Anti-affinity scenario cannot be solved. You pay more attention to OSS related scenarios. Let me take the following example. When the input actor reads data from the OSS, it needs to slice it and then distribute it to the downstream operator. At this time, input actors need to be distributed on each node. If only the actor handle is used for implementation, there is the problem of actor failover.
image

the only major Ray library that partly fits this use case is Ray Serve, and AFAIK Serve does not have as hard requirements for actor affinity/anti-affinity.

In fact, I think Ray serve/RLlib/dataset has many requirements for actor affinity/anti affinity. For example, the same type of serve.deployment in Ray serve wants to be distributed on each node to achieve volume weight and reduce mutual influence. These functions will be elegant and simple to implement with label.

@ericl
Copy link
Contributor

ericl commented Jan 23, 2023

This is an interesting discussion with regards to fault tolerance and simplicity of APIs. I think I agree that there are significant pain points actor label affinity solves. However, considering other scheduling improvements this might not be the first one we'd want to land. For example, I think it would be better to support basic labeling support first and nail down those interactions with the scheduler.

I think it would be productive to think about flexible scheduling more in general. If scheduling constraints could implemented in a nicely "pluggable" way, then policies like these should be easy to prototype and introduce without big architectural ramifications.

So here's my proposal:

  • Let's think about a pluggable "scheduler constraint" system that supports the use cases in this REP (i.e., soft and hard constraints based on static and runtime scheduling information).
  • We should think about how this would be implemented in depth, including autoscaler integration.
  • Then we should have a bunch of possible scheduler plugins, one of which could be the actor label one.

As an example, for Ray runtime envs we have a plugin system which has been pretty useful for handling both basic and advanced use cases without needing any architectural changes.

@larrylian
Copy link
Contributor Author

larrylian commented Feb 2, 2023

@ericl

Let's think about a pluggable "scheduler constraint" system that supports the use cases in this REP (i.e., soft and hard constraints based on static and runtime scheduling information).

I fully agree with a pluggable "scheduler constraint" system. In fact, the current ray core schedule strategy system is slowly being restructure into a pluggable system. The scheduling strategies of the newly added ActorAffinity/NodeAffinity will also be implemented as independent and pluggable as possible.

image

@larrylian
Copy link
Contributor Author

@ericl @stephanie-wang @scv119

Now I also realize that using a complex set of APIs at once wiil indeed have a great impact, and everyone may not be able to accept ti immediately. So I agree with your previous premise that sample and valueable parts should be implemented first, and discuss the complex solutions later after this piece is completed.

I will splite a new REP to discuss the first part. The first part is mainly the following feature:

  1. NodeAffinity - Add static labels to node (can only be set at node start)) and Enhance the function of NodeAffinity based on Labels.
  2. ActorAffinity - Add actorAffinity scheduling strategy using actor handles . The labels of actors will not be introduced first, but the internal implementation will be implemented in the way of labels, so as to facilitate the expansion of actorAffinity based on labels in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants