Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Readme doc for physical mode #49457

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Dec 27, 2024

Discussed with @rynewang offline, it would be nice to have a short doc illustrating cgroup struction and lifecycle.

@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Dec 27, 2024
@dentiny dentiny requested review from jjyao and rynewang December 27, 2024 07:40
@dentiny dentiny force-pushed the hjiang/cgroup-readme branch from 1bf9563 to 34184d3 Compare December 27, 2024 07:42
Signed-off-by: dentiny <[email protected]>
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,35 @@
### Ray core cgroup documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why heading 3 and heading 4, instead of 1 and 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start from header 4 and going bigger (3 -> 2 -> 1). Do you have a strong preference on that?


- The feature is built upon cgroup, which only supports linux;
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode;
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more clear on what "heuristric estimation" we use. Move this to ## Physical execution mode on how to use.

e.g.

If physical execution mode is enabled, Ray uses cgroup to restrict resource usage. Now we only support using `memory` as cgroup `memory.max` to cap a task process (and all its subprocesses recursively)'s max memory usage.

For example,

@ray.remote(memory=500 * 1024 * 1024)
def some_function(x):
    pass

obj = some_function.remote()

This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.

Also add lifetimes of a cgroup vs task attempts. Describe how you put a (pre existing idle) worker into a cgroup, and when you move it out and remove a cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
/ \
.../system_cgroup .../application_cgroup
/ \
.../default_cgroup/ .../<task_id>_<attempt_id>_cgroup (*N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] task_default, and task_<task_id>_<attempt_id> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest commit I just call it default; do we need to differentiate task and actor here? Since from users' perspective they're different concepts, and cgroup is actually exposed to users.

src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
- Only cgroup v2 is supported, meanwhile ray also requires application to have write permission and cgroup v2 be mounted in rw mode;
- At the initial version, ray caps max resource usage via heuristric estimation (TODO: support user passed-in value).

#### Implementation details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit, (2) raylet crash. Do you have some good ideas on this?

For (1) we certainly don't want these procs. We already use linux subreaper to kill leaked procs via RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper and maybe we can have another flag like RAY_kill_child_processes_on_worker_exit_with_cgroup so on cgroup exit we SIGKILL all procs in cgroup. WDYT?

For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.

Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) subprocesses holding the cgroup can't be deleted on task exit

I see, integrating the cgroup cleanup is a good idea, but we might need a way to get cgroup folder name with unknown PID. Or a hacky way is to have another setup of cgroups which cannot be deleted after task / actor completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (2) I can't think of a good idea. If you SIGKILL raylet, all cgroups are destined to leak. Maybe we can just accept it. But if so we need to document it both in here and in the eventual user facing document.

I cannot think of a good way either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we have a Q (3): do we want to restrict processes created by raylet to don't have permission to move their own cgroups? Well, Ray does not have a lot of security features so we might be OK with this.

Don't think it's something we could handle at the moment; theoretically we are not container-ize user process, so it could access everything in the node / filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[general] I am a bit worried about cgroup leaks on (1) subprocesses holding the cgroup can't be deleted on task exit

At the initial version, I would like to implement a simple RAII-style cgroup setup and cleanup for raylet.

@dentiny dentiny requested a review from rynewang December 28, 2024 00:08
Signed-off-by: dentiny <[email protected]>
Copy link
Contributor

@rynewang rynewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and suggestions. The goal is: another software engineer can implement it out with this readme file. hence, in tomorrow they can debug it as well.


obj = some_function.remote()
```
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what happens if you set memory=1 byte, and try to move a proc to such a cgroup? is it " proc move failure" or proc get killed instantly after moving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The later.

Ray core supports a physical execution mode, which allows users to cap resource consumption for their applications.

A few benefits:
- It prevents application from eating up unlimited resource to starve other applications running on the same node;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think these 2 bullet points are not parallel or mutual exclusive... and it's not "benefits" but more of a neutrual "behavior". Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's not clear what is "application" in this line. does it mean non-Ray or other-Raylet processes?

I renamed to it "user application".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it's not "benefits" but more of a neutrual "behavior"

Well it's the benefit / motivation compared to "no cgroup", which is the current implementation.

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 bullet points are not parallel or mutual exclusive...

A second read, these two bullet points are expression the similar meaning, so I combine them into one.

obj = some_function.remote()
```
This function is limited by 500MiB memory usage, and if it tries to use more, it OOMs and fails.
+ If a task / actor is not annotated with resource usage, ray caps max resource usage via heuristric estimation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it dead clear on the "heuristric estimation"? that is, do we have any limitations on the default leaf cgroup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might mis-understand, but I added an implementation detail on how max resource is set.

src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
/ \
.../internal .../application
/ \
.../default/ .../<task_id>_<attempt_id> (*N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] remote the / after default ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is it more clear to use a tree like shape?

/sys/fs/cgroup/ray_node_<node_id>
├── internal
└── application
    ├── default
    └── <task_id>_<attempt_id> (*N)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also specify the attempt_id is raylet-wide, or per-task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also specify the attempt_id is raylet-wide, or per-task.

I haven't decided yet, leave a TODO to fill in later.

src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved
src/ray/common/cgroup/README.md Outdated Show resolved Hide resolved

- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s)
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say the max resource = what. it should default to 0.8 total mem, or a user specified mem resource amount on node start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've already mentioned several lines above? Anyway add more wording as you wish.

- Raylet is responsible to create cgroup folder at startup, and cleanup the folder at its destruction
- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s)
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes
- If a task / actor execute with their max resource specified, they will be placed in a dedicated cgroup, identified by the task id and attempt id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write about the attempt id here

Copy link
Contributor Author

@dentiny dentiny Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what are you expecting here? Anyway I added some wording on attempt id.

@dentiny dentiny force-pushed the hjiang/cgroup-readme branch from 7859cd1 to 6e18292 Compare December 28, 2024 06:41
@dentiny dentiny requested a review from rynewang December 28, 2024 07:04
Copy link
Contributor

@rynewang rynewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a question left


- Each ray node having their own cgroup folder, which contains the node id to differentiate with other raylet(s); in detail, raylet is responsible to create cgroup folder `/sys/fs/cgroup/ray_node_<node_id>`, `/sys/fs/cgroup/ray_node_<node_id>/internal` and `/sys/fs/cgroup/ray_node_<node_id>/application` at startup, and cleans up the folder upon process exit;
- `/sys/fs/cgroup/ray_node_<node_id>/application` is where ray sets overall max resource for all application processes
+ The max resource respects users' input on node start, or a heuristic value 80% of all logical resource will be taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads "if user does not set, we cap to 0.8 * logical resources == 0.64 physical resources", while I think we really mean:

"if user does not set, we cap to 1 * logical resources == 0.8 physical resources" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand, I haven't mentioned any physical resource in the doc, why do you think so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants