Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REP][Core]New scheduling feature: taints and tolerations #44

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

larrylian
Copy link
Contributor

@larrylian larrylian commented Aug 28, 2023

We plan to introduce the Taints & Tolerations feature to achieve scarce resource isolation, especially for GPU nodes, where preventing ordinary CPU tasks from being scheduled on GPU nodes is a key requirement that many ray users expect.

Key concepts:

Ray scheduling framework

Taints & Tolerations concepts

  1. If you don't want normal cpu task/actor to be scheduled on GPU node, You can add a taint to a gpu node(node1) using ray cli. For example:
# node1
ray start --taints={"gpu_node":"true"} --num-gpus=1
  1. Normal cpu task/actor/placement group will not be scheduled on GPU node.

The actor/pg will not be scheudled onto node1

actor = Actor.options(num_cpus=1).remote()
pg = ray.util.placement_group(bundles=[{"CPU": 1}])
  1. Then you want to schedule gpu task onto gpu node(node1), you can specify a toleration for task.

The actor/pg would be able to scheudled onto node1

actor = Actor.options(num_gpus=1, tolerations ={"gpu_node": Exists()}).remote()
pg = ray.util.placement_group(bundles=[{"GPU": 1}], tolerations ={"gpu_node": Exists()})

You can also use taints to achieve node isolation.

  1. If you want to isolate a node with memory pressure so that tasks are not scheduled onto it. You can use ray taint:
ray taint --node-id {node_id_1} --apend {"memory-pressure":"high"}

Then the new task/actor/pg will not be schedule onto node1.

  1. You can restore the node once the memory pressure on the node is reduced to a low level.
ray taint --node-id {node_id_1} --delete {"memory-pressure":"high"}

Then the new task/actor/pg will be able to schedule onto node1.

@larrylian larrylian force-pushed the taints_torelation branch 3 times, most recently from 31fcb64 to 9fcb558 Compare August 31, 2023 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants