Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Readme doc for physical mode #49457

Merged
merged 18 commits into from
Dec 31, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions src/ray/common/cgroup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
## Ray core cgroup documentation

### Physical execution mode

Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications.

A few benefits:
- If physical execution mode is enabled, Ray uses cgroup to restrict resource usage, so other processes running on the same machine (i.e. system processes like raylet and GCS) won't get starved or even killed. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage. For example,
```python
@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
pass

obj = some_function.remote()
```
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The later.

+ User can set the limit to any number at node start; if not, ray will take a heuristric estimation on all application processes (i.e. 80% of the total logical resource). This is implemented by setting a max value on `/sys/fs/cgroup/ray_node_<node_id>/application` node (see chart below).

TODO(hjiang): reserve minimum resource will be supported in the future.

### Prerequisites

- Ray runs in a Linux environment that supports Cgroup V2.
- The cgroup2 filesystem is mounted at `/sys/fs/cgroup`.
- Raylet has write permission to that mounted directory.
- If any of the prerequisites unsatisfied, when physical mode enabled, ray logs error and continue running.

### Disclaimer

- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more clear on what "heuristric estimation" we use. Move this to ## Physical execution mode on how to use.

e.g.

If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage.

For example,

@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
    pass

obj = some_function.remote()

This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.

Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


### Implementation details

#### Cgroup hierarchy

cgroup v2 folders are created in tree structure as follows

```
/sys/fs/cgroup/ray_node_<node_id>
/ \
.../internal .../application
/ \
.../default .../<task_id>_<attempt_id> (*N)
```

- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s); in detail, raylet is responsible to create cgroup folder `/sys/fs/cgroup/ray_node_<node_id>`, `/sys/fs/cgroup/ray_node_<node_id>/internal` and `/sys/fs/cgroup/ray_node_<node_id>/application` at startup, and cleans up the folder upon process exit;
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've already mentioned several lines above? Anyway add more wording as you wish.

+ The max resource respects users' input on node start, or a heuristic value 80% of all logical resource will be taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads "if user does not set, we cap to 0.8 * logical resources == 0.64 physical resources", while I think we really mean:

"if user does not set, we cap to 1 * logical resources == 0.8 physical resources" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand, I haven't mentioned any physical resource in the doc, why do you think so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-worded a little, hopefully it gets better :)

- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id; the cgroup path is `/sys/fs/cgroup/ray_node_<node_id>/application/<task_id>_<attempt_id>`
+ Task id is a string which uniquely identifies a task
+ Attempt id is a monotonically increasing integer, which is used to different executions for the same task and indicates their order
- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_node_<node_id>/application`

TODO(hjiang): Add more details on attempt id. For example, whether it's raylet-wise or task-wise.

#### Cgroup lifecycle

A cgroup's lifecycle is bound by a task / actor attempt.
Before execution, the worker PID is placed into the cgroup;
after its completion, the idle worker is put back to worker pool and reused later, with its PID moved back to the default cgroup, and cgroup destructed if any.

TODO(hjiang): Add discussion on how to deal with situations when task finishes, while some of the processes don't finish.
Loading