-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Readme doc for physical mode #49457
Changes from 7 commits
32610f7
bdfa41a
cecd490
b85a652
c0af176
5c4c5df
df3ce52
b97a14b
6e18292
35f2aba
22b8246
871bd85
e2de6c3
35108dd
09cef17
75c619c
9cae314
79aa480
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
## Ray core cgroup documentation | ||
|
||
### Physical execution mode | ||
|
||
Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications. | ||
|
||
A few benefits: | ||
- It prevents application from eating up unlimited resource to starve other applications running on the same node; | ||
- If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage. For example, | ||
```python | ||
@ray.remote(memory=500 * 1024 * 1024) | ||
def some_function(x): | ||
pass | ||
|
||
obj = some_function.remote() | ||
``` | ||
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The later. |
||
+ If a task / actor is not annotated with resource usage, ray caps max resource usage via heuristric estimation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we make it dead clear on the "heuristric estimation"? that is, do we have any limitations on the default leaf cgroup? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might mis-understand, but I added an implementation detail on how max resource is set. |
||
|
||
TODO(hjiang): reserve minimum resource will be supported in the future. | ||
|
||
### Prerequisites | ||
|
||
- The feature is built upon cgroup, which only supports linux; | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode; | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- If any of the prerequisites unsatisfied, when physical mode enabled, ray logs error with program keep working. | ||
|
||
### Disclaimer | ||
|
||
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make this more clear on what "heuristric estimation" we use. Move this to e.g.
Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
|
||
### Implementation details | ||
|
||
#### Cgroup hierarchy | ||
|
||
cgroup v2 folders are created in tree structure as follows | ||
|
||
``` | ||
/sys/fs/cgroup/ray_node_<node_id> | ||
/ \ | ||
.../internal .../application | ||
/ \ | ||
.../default/ .../<task_id>_<attempt_id> (*N) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] remote the / after default ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Q: is it more clear to use a tree like shape?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also specify the attempt_id is raylet-wide, or per-task. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I haven't decided yet, leave a TODO to fill in later. |
||
``` | ||
|
||
- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s) | ||
dentiny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we've already mentioned several lines above? Anyway add more wording as you wish. |
||
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. write about the attempt id here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering what are you expecting here? Anyway I added some wording on attempt id. |
||
- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_node_<node_id>/application` | ||
|
||
#### Cgroup lifecycle | ||
|
||
A cgroup's lifecycle is bound by a task / actor attempt. | ||
Before execution, the worker PID is placed into the cgroup; | ||
after its completion, the idle worker is put back to worker pool and reused later, with its PID moved back to the default cgroup, and cgroup destructed if any. | ||
|
||
TODO(hjiang): Add discussion on how to deal with situations when task finishes, while some of the processes don't finish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to it "user application".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's the benefit / motivation compared to "no cgroup", which is the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A second read, these two bullet points are expression the similar meaning, so I combine them into one.