-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Readme doc for physical mode #49457
Changes from 1 commit
32610f7
bdfa41a
cecd490
b85a652
c0af176
5c4c5df
df3ce52
b97a14b
6e18292
35f2aba
22b8246
871bd85
e2de6c3
35108dd
09cef17
75c619c
9cae314
79aa480
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
### Ray core cgroup documentation | ||
|
||
#### Physical execution mode | ||
|
||
Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications. | ||
|
||
A few benefits: | ||
- It prevents application from eating up unlimited resource to starve other applications running on the same node; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I renamed to it "user application". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well it's the benefit / motivation compared to "no cgroup", which is the current implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A second read, these two bullet points are expression the similar meaning, so I combine them into one. |
||
- It protects other processes on the node (including ray system components) to be killed due to insufficient resource (i.e. OOM); | ||
|
||
TODO(hjiang): reserve minimum resource will be supported in the future. | ||
|
||
#### Disclaimer and presumption | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- The feature is built upon cgroup, which only supports linux; | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode; | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make this more clear on what "heuristric estimation" we use. Move this to e.g.
Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
|
||
#### Implementation details | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit, (2) raylet crash. Do you have some good ideas on this? For (1) we certainly don't want these procs. We already use linux subreaper to kill leaked procs via RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper and maybe we can have another flag like For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document. Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see, integrating the cgroup cleanup is a good idea, but we might need a way to get cgroup folder name with unknown PID. Or a hacky way is to have another setup of cgroups which cannot be deleted after task / actor completion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I cannot think of a good way either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Don't think it's something we could handle at the moment; theoretically we are not container-ize user process, so it could access everything in the node / filesystem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
At the initial version, I would like to implement a simple RAII-style cgroup setup and cleanup for raylet. |
||
|
||
cgroup v2 folders are created in tree structure as follows | ||
|
||
``` | ||
/sys/fs/cgroup/ray_<node_id> | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/ \ | ||
.../system_cgroup .../application_cgroup | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/ \ | ||
.../default_cgroup/ .../<task_id>_<attempt_id>_cgroup (*N) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] task_default, and task_<task_id>_<attempt_id> ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my latest commit I just call it |
||
``` | ||
|
||
- Raylet is responsible to create and cleanup its own cgroup folder | ||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s) | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `/sys/fs/cgroup/ray_<node_id>/application_cgroup` is where ray sets overall max resource for all application processes | ||
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. write about the attempt id here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering what are you expecting here? Anyway I added some wording on attempt id. |
||
- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_<node_id>/application_cgroup` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why heading 3 and heading 4, instead of 1 and 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start from header 4 and going bigger (3 -> 2 -> 1). Do you have a strong preference on that?