Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Readme doc for physical mode #49457

Merged
merged 18 commits into from
Dec 31, 2024
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions src/ray/common/cgroup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## Ray core cgroup documentation

### Physical execution mode

Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications.

A few benefits:
- It prevents application from eating up unlimited resource to starve other applications running on the same node;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

I renamed to it "user application".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it's not "benefits" but more of a neutrual "behavior"

Well it's the benefit / motivation compared to "no cgroup", which is the current implementation.

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 bullet points are not parallel or mutual exclusive...

A second read, these two bullet points are expression the similar meaning, so I combine them into one.

- If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage. For example,
```python
@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
pass

obj = some_function.remote()
```
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The later.

+ If a task / actor is not annotated with resource usage, ray caps max resource usage via heuristric estimation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it dead clear on the "heuristric estimation"? that is, do we have any limitations on the default leaf cgroup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might mis-understand, but I added an implementation detail on how max resource is set.


TODO(hjiang): reserve minimum resource will be supported in the future.

### Prerequisites

- The feature is built upon cgroup, which only supports linux;
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode;
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- If any of the prerequisites unsatisfied, when physical mode enabled, ray logs error with program keep working.

### Disclaimer

- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more clear on what "heuristric estimation" we use. Move this to ## Physical execution mode on how to use.

e.g.

If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage.

For example,

@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
    pass

obj = some_function.remote()

This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.

Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


### Implementation details

#### Cgroup hierarchy

cgroup v2 folders are created in tree structure as follows

```
/sys/fs/cgroup/ray_node_<node_id>
/ \
.../internal .../application
/ \
.../default/ .../<task_id>_<attempt_id> (*N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] remote the / after default ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is it more clear to use a tree like shape?

/sys/fs/cgroup/ray_node_<node_id>
├── internal
└── application
    ├── default
    └── <task_id>_<attempt_id> (*N)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also specify the attempt_id is raylet-wide, or per-task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also specify the attempt_id is raylet-wide, or per-task.

I haven't decided yet, leave a TODO to fill in later.

```

- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s)
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've already mentioned several lines above? Anyway add more wording as you wish.

- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write about the attempt id here

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what are you expecting here? Anyway I added some wording on attempt id.

- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_node_<node_id>/application`

#### Cgroup lifecycle

A cgroup's lifecycle is bound by a task / actor attempt.
Before execution, the worker PID is placed into the cgroup;
after its completion, the idle worker is put back to worker pool and reused later, with its PID moved back to the default cgroup, and cgroup destructed if any.

TODO(hjiang): Add discussion on how to deal with situations when task finishes, while some of the processes don't finish.
Loading