-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add labels mechanism and actor-affinity-feature proposal #13
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This closely matches the k8s pod affinity/anti-affinity (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity).
We may want label based node affinity/anti-affinity as well in the future but I think it can be solved separately.
3aae737
to
26bcdb2
Compare
For failure section, we should add more discussions on how we promise the scheduling result consensus. I'd also like to know if it's possible to achieve the following typical goals: |
I'll add some more failure scenarios
This feature is too complicated, and needs to be discussed, so it is not implemented in this feature. |
26bcdb2
to
a265d38
Compare
3616191
to
aacbbb4
Compare
aacbbb4
to
09312a6
Compare
e9a9783
to
06fc962
Compare
06fc962
to
acc5520
Compare
reps/2022-11-23-labels-mechanish-and-affinity-schedule-feature.md
Outdated
Show resolved
Hide resolved
reps/2022-11-23-labels-mechanish-and-affinity-schedule-feature.md
Outdated
Show resolved
Hide resolved
e32cf5a
to
a83b9a1
Compare
Hmm I agree with @ericl, I'm concerned the current scope of the REP is too broad. If we want to go ahead with full labels and expressions, we need deeper design review for the API. For the use cases listed, I think we could pare down this REP to only allowing passing in other actor handles and/or custom resources to the scheduling strategy, and just use flags for affinity/anti-affinity instead of expressions. Of course we could always improve it later to add labels/expressions once we have more information from users. How does that sound? |
|
This REP introduces the concept of labels, which is quite different from custom_resource. This labels will greatly improve the scheduling of ray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need a section when scheduling cannot happen. More concretely, what's the behavior when
- resources are enough, but affinity cannot be satisfied?
- affinity can be satisfied, but resources are not enough?
- both affinity and resources are not satisfied?
Do we throw an exception? Will it just hang? If so, how will this be exposed to users?
This also requires how this API will play with autoscaler. (if it can't satisfy the resource request now, but it can after autoscaling up?)
Thanks for the comments. Overall, I agree with the value of adding node labels for more straightforward and powerful affinity/anti-affinity scheduling. I do think this is a pretty big change--- perhaps the biggest change to Ray scheduling since placement groups. The scope and design of the change are my main concerns. I don't think the current REP sufficiently outlines the changes necessary to fully support this feature in Ray. So here's my proposal, we split this REP into two separate REPs:
Thoughts? Also cc @wuisawesome |
Thanks, @larrylian, I see the value of having labels, but I think we should try to get consensus on a more minimal short-term API change that will still be useful. Once we expand the scheduling API it becomes very difficult to go back, so we need to focus our efforts on APIs that will fit common use cases. What do you think about the following, and we can split these into one REP each? These will likely not cover all of the use cases that you mentioned, but I believe they are uncontroversial and will already add value.
Long-term, we can think about the following solutions, and for now just summarize the key use cases:
|
I agree. The first part is very similar to the k8s API and seems clean and uncontroversial. The second part (actor affinity) seems significantly more complicated and risky. |
1. Tag the Actor with key-value labels first | ||
2. Then affinity or anti-affinity scheduling to the actors of the specified labels. | ||
|
||
![ActorAffinityScheduling](https://user-images.githubusercontent.com/11072802/188054945-48d980ed-2973-46e7-bf46-d908ecf93b31.jpg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any examples of existing systems that have this ActorAffinity concept? Or was there any existing inspiration for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ActorAffinity is similar to podAffinity/NodeAffinity of k8s.
Thanks for the good question. For this abnormal scene, I will add a chapter to REP. Q1:
Q2: |
ActorAffinity is the specific implementation feature discussed in the first part. |
What is mentioned now is that ActorAffinity/NodeAffinity is actually part of PG 2.0. Q2:
I am now actually divided into two REPs for discussion. Only now both REPs are in the same PR.
That's a good question. |
Thanks for your very valuable opinion.
This is a very good idea, so I added the syntactic sugar of using actor_handle to solution 1.
For Node's Label, I also intend to do the same. The first step is to use a static label. Then expand the dynamic node label in the long term.
This is my opinion is the first step to achieve. |
Thanks for updating the REPs! But I don't quite understand why we need labels to implement ActorAffinity. It seems much easier to just implement actor affinity directly with handles:
Scheduling with dynamic labels on the other hand introduces complexity around what happens if the label fails to be created due to actor failure, if the label is created by something other than the intended actor due to user error, how to make this work with autoscaling, etc. With dynamic labels, it seems like there will always be an issue of knowing how long to wait for a label to appear. I can see how the labels could help with scalability of actor affinity in the future but the added complexity doesn't seem worthwhile for a first version. In the interest of supporting the desired feature ASAP (actor affinity scheduling), I suggest:
|
Thanks for providing the context, but I believe that these issues are not necessarily the common case for Ray users today so these seem like premature optimizations right now. It would be better to introduce the API first and from there gather feedback from the rest of the OSS community on usability/scalability before we go ahead with a potentially complex system. Do you see a problem with the approach of implementing a basic version using only actor handles first and later deciding whether/how to add a labels-based implementation? In more detail:
Failover semantics is indeed something that should be covered in the REP but actually I don't see why you can't accomplish the same thing using only actor handles. The GCS should know exactly which actors are placed where, so it can simply rerun the affinity/anti-affinity policy on FO.
It should be straightforward to support something like "get own handle" or "pass
Again, this API doesn't exist yet so we don't have evidence on usability. Also, there are very few use OSS use cases right now that are using this many actors. |
Maybe let's think about one particular edge case: scheduling a lot of actors each with mutual anti-affinity. IIUC, the concern is this leads to O(n) sized constraints, such as:
This is true, but I wonder why the user wouldn't use a single SPREAD or STRICT_SPREAD placement group in the first place to schedule these actors. There is a pending REP to substantially improve the placement groups API as well, to be more flexible. I think the combination of enhanced placement groups, node labels, and actor handle affinity could be pretty powerful. If we can see common situations that can't be handled by these features together, I think we should try to dive more into those scenarios. |
If a1 failover, a1 will maybe schedule to node of a2~an。 You can see Ericl's reply. If you only use actor-handle to implement anti-affinity. There will be this problem.
Although you said that you can add "get own handle" to solve it. But this is just one scenario I exemplified. There are many other scenarios where actor handles cannot be easily obtained. For example, the DAG multi-level scenarios and cross-language calls we mentioned above. And if the number of Actors is relatively large, the code development cost for users to obtain actor_handle is very high.
The diagram of the "output" operator above is just an example. It could also be "map" or "reduce" etc. It may also be a topology map of other scenarios. Now a job with hundreds of actors is actually very common. Our ray has developed for so many years now, I think we should pay more attention to these enterprise-level large-scale scenarios now.
It is because of the problems in many of the above scenarios, especially the Actor Failover scenario. As a result, only using actor_handle cannot meet the needs of users. |
PG is only suitable for scenarios where the number of Actors is known in advance. But now many jobs are long-time jobs, which will be dynamically scaled with time or traffic. The number of Actors it wants to aggregate is not fixed.
Although if the PG is enhanced, the new PG resize function can solve the above problems. But it is also very complicated for users to use.
1 > For scenarios that only require Affinity/Anti-affinity functions, using PG requires users to manage PG resources by themselves. This will increase the complexity of use, which is not the original intention of ray. 2 > The current PG is already very complicated, and it will be more complicated to implement PG resize. At that time, performace, consistency, stability, etc. will have big problem. 3 > Now the implementation of PG needs to consider the three functions of affinity, resource reservation, and gang-schedule submitted in two phases at the same time, and the internal implementation is very complicated. This leads to many problems under high load usage of large-scale clusters:
Because of these problems above and others. Our internal users often complain about the PG function. Then more and more users have given up using PG. |
Thank you for the explanation, this makes sense.
From my perspective, it seems like the issues with the actor handle implementation mainly come from large-scale multitenant Ray clusters with long-lived services. While I agree this is an important scenario in large-scale enterprise cases that we should pay more attention to, this is not yet the common case for most users, especially the multitenancy part. For example, the only major Ray library that partly fits this use case is Ray Serve, and AFAIK Serve does not have as hard requirements for actor affinity/anti-affinity. To be clear, I am not against implementing the labels version; I just don't think it's the right thing to start with right now. If we start with a more minimal initial version, we can collect feedback for a more holistic solution (that could include the improved placement groups API that @ericl mentioned), and there is also less chance that we will paint ourselves into a corner with an insufficient API. It's possible we will end up converging on labels in the end anyway, but it's not clear to me yet what the alternatives are and how it will fit with other ongoing scheduling enhancements. So, considering the current OSS use cases, do you see a problem with the approach of implementing a basic version using only actor handles first? |
This is an interesting discussion with regards to fault tolerance and simplicity of APIs. I think I agree that there are significant pain points actor label affinity solves. However, considering other scheduling improvements this might not be the first one we'd want to land. For example, I think it would be better to support basic labeling support first and nail down those interactions with the scheduler. I think it would be productive to think about flexible scheduling more in general. If scheduling constraints could implemented in a nicely "pluggable" way, then policies like these should be easy to prototype and introduce without big architectural ramifications. So here's my proposal:
As an example, for Ray runtime envs we have a plugin system which has been pretty useful for handling both basic and advanced use cases without needing any architectural changes. |
I fully agree with a pluggable "scheduler constraint" system. In fact, the current ray core schedule strategy system is slowly being restructure into a pluggable system. The scheduling strategies of the newly added ActorAffinity/NodeAffinity will also be implemented as independent and pluggable as possible. |
@ericl @stephanie-wang @scv119 Now I also realize that using a complex set of APIs at once wiil indeed have a great impact, and everyone may not be able to accept ti immediately. So I agree with your previous premise that sample and valueable parts should be implemented first, and discuss the complex solutions later after this piece is completed. I will splite a new REP to discuss the first part. The first part is mainly the following feature:
|
1. Labels mechanism.
Introduce the Labels mechanism. Give Labels to Actors/Nodes.
Affinity features such as ActorAffinity/NodeAffinity can be realized through Labels.
Read reps/2022-11-23-labels-mechanish-and-affinity-schedule-feature.md first.
2. Actor-Affinity Feature
Provides a set of lightweight actor affinity and anti-affinity scheduling interfaces.
Replacing the heavy PG interface to implement collocate or spread actors.