-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Readme doc for physical mode #49457
Changes from all commits
32610f7
bdfa41a
cecd490
b85a652
c0af176
5c4c5df
df3ce52
b97a14b
6e18292
35f2aba
22b8246
871bd85
e2de6c3
35108dd
09cef17
75c619c
9cae314
79aa480
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
## Ray core cgroup documentation | ||
|
||
### Physical execution mode | ||
|
||
Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications. | ||
|
||
A few benefits: | ||
- If physical execution mode is enabled, Ray uses cgroup to restrict resource usage, so other processes running on the same machine (i.e. system processes like raylet and GCS) won't get starved or even killed. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage. For example, | ||
```python | ||
@ray.remote(memory=500 * 1024 * 1024) | ||
def some_function(x): | ||
pass | ||
|
||
obj = some_function.remote() | ||
``` | ||
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails. | ||
+ User can set the limit to any number at node start; if not, ray will take a heuristric estimation on all application processes (i.e. 80% of the total logical resource). This is implemented by setting a max value on `/sys/fs/cgroup/ray_node_<node_id>/application` node (see chart below). | ||
|
||
TODO(hjiang): reserve minimum resource will be supported in the future. | ||
|
||
### Prerequisites | ||
|
||
- Ray runs in a Linux environment that supports Cgroup V2. | ||
- The cgroup2 filesystem is mounted at `/sys/fs/cgroup`. | ||
- Raylet has write permission to that mounted directory. | ||
- If any of the prerequisites unsatisfied, when physical mode enabled, ray logs error and continue running. | ||
|
||
### Disclaimer | ||
|
||
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make this more clear on what "heuristric estimation" we use. Move this to e.g.
Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
|
||
### Implementation details | ||
|
||
#### Cgroup hierarchy | ||
|
||
cgroup v2 folders are created in tree structure as follows | ||
|
||
``` | ||
/sys/fs/cgroup/ray_node_<node_id> | ||
/ \ | ||
.../internal .../application | ||
/ \ | ||
.../default .../<task_id>_<attempt_id> (*N) | ||
``` | ||
|
||
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s); in detail, raylet is responsible to create cgroup folder `/sys/fs/cgroup/ray_node_<node_id>`, `/sys/fs/cgroup/ray_node_<node_id>/internal` and `/sys/fs/cgroup/ray_node_<node_id>/application` at startup, and cleans up the folder upon process exit; | ||
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we've already mentioned several lines above? Anyway add more wording as you wish. |
||
+ The max resource respects users' input on node start, or a heuristic value 80% of all logical resource will be taken | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reads "if user does not set, we cap to 0.8 * logical resources == 0.64 physical resources", while I think we really mean: "if user does not set, we cap to 1 * logical resources == 0.8 physical resources" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I don't understand, I haven't mentioned any physical resource in the doc, why do you think so? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I re-worded a little, hopefully it gets better :) |
||
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id; the cgroup path is `/sys/fs/cgroup/ray_node_<node_id>/application/<task_id>_<attempt_id>` | ||
+ Task id is a string which uniquely identifies a task | ||
+ Attempt id is a monotonically increasing integer, which is used to different executions for the same task and indicates their order | ||
- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_node_<node_id>/application` | ||
|
||
TODO(hjiang): Add more details on attempt id. For example, whether it's raylet-wise or task-wise. | ||
|
||
#### Cgroup lifecycle | ||
|
||
A cgroup's lifecycle is bound by a task / actor attempt. | ||
Before execution, the worker PID is placed into the cgroup; | ||
after its completion, the idle worker is put back to worker pool and reused later, with its PID moved back to the default cgroup, and cgroup destructed if any. | ||
|
||
TODO(hjiang): Add discussion on how to deal with situations when task finishes, while some of the processes don't finish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The later.