Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Readme doc for physical mode #49457

Merged
merged 18 commits into from
Dec 31, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions src/ray/common/cgroup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
### Ray core cgroup documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why heading 3 and heading 4, instead of 1 and 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start from header 4 and going bigger (3 -> 2 -> 1). Do you have a strong preference on that?


#### Physical execution mode

Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications.

A few benefits:
- It prevents application from eating up unlimited resource to starve other applications running on the same node;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

I renamed to it "user application".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it's not "benefits" but more of a neutrual "behavior"

Well it's the benefit / motivation compared to "no cgroup", which is the current implementation.

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 bullet points are not parallel or mutual exclusive...

A second read, these two bullet points are expression the similar meaning, so I combine them into one.

- It protects other processes on the node (including ray system components) to be killed due to insufficient resource (i.e. OOM);

TODO(hjiang): reserve minimum resource will be supported in the future.

#### Disclaimer and presumption
dentiny marked this conversation as resolved.
Show resolved Hide resolved

- The feature is built upon cgroup, which only supports linux;
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode;
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more clear on what "heuristric estimation" we use. Move this to ## Physical execution mode on how to use.

e.g.

If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage.

For example,

@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
    pass

obj = some_function.remote()

This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.

Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


#### Implementation details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit, (2) raylet crash. Do you have some good ideas on this?

For (1) we certainly don't want these procs. We already use linux subreaper to kill leaked procs via RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper and maybe we can have another flag like RAY_kill_child_processes_on_worker_exit_with_cgroup so on cgroup exit we SIGKILL all procs in cgroup. WDYT?

For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.

Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) subprocesses holding the cgroup can't be deleted on task exit

I see, integrating the cgroup cleanup is a good idea, but we might need a way to get cgroup folder name with unknown PID. Or a hacky way is to have another setup of cgroups which cannot be deleted after task / actor completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.

I cannot think of a good way either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.

Don't think it's something we could handle at the moment; theoretically we are not container-ize user process, so it could access everything in the node / filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit

At the initial version, I would like to implement a simple RAII-style cgroup setup and cleanup for raylet.


cgroup v2 folders are created in tree structure as follows

```
/sys/fs/cgroup/ray_<node_id>
dentiny marked this conversation as resolved.
Show resolved Hide resolved
/ \
.../system_cgroup .../application_cgroup
dentiny marked this conversation as resolved.
Show resolved Hide resolved
/ \
.../default_cgroup/ .../<task_id>_<attempt_id>_cgroup (*N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] task_default, and task_<task_id>_<attempt_id> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest commit I just call it default; do we need to differentiate task and actor here? Since from users' perspective they're different concepts, and cgroup is actually exposed to users.

```

- Raylet is responsible to create and cleanup its own cgroup folder
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s)
dentiny marked this conversation as resolved.
Show resolved Hide resolved
- `/sys/fs/cgroup/ray_<node_id>/application_cgroup` is where ray sets overall max resource for all application processes
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write about the attempt id here

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what are you expecting here? Anyway I added some wording on attempt id.

- Otherwise they will be placed under default application cgroup, having their max consumption bound by `/sys/fs/cgroup/ray_<node_id>/application_cgroup`
Loading