Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Submit multiple RayJobs concurrently will cause ray-operator to slow down #2646

Open
1 of 2 tasks
Moonquakes opened this issue Dec 13, 2024 · 2 comments
Open
1 of 2 tasks

Comments

@Moonquakes
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I configured kuberay v1.2.2 with --enable-leader-election turned on, and then set up three replicas. I found in monitoring that if a large number of RayJobs are submitted at the same time, the ray-operator queue will be blocked, affecting other Ray Clusters.
61bd0408-8bca-460a-b895-d97ac27ced46

I want to confirm whether ray-operator is stateless. If I do not set --enable-leader-election and set more replicas, can this situation be alleviated?

Reproduction script

Submit 100 RayJobs to kuberay at the same time through the script

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Moonquakes Moonquakes added bug Something isn't working triage labels Dec 13, 2024
@kevin85421
Copy link
Member

can you benchmark the queuing delay with 100 RayCluster (no RayJob) vs 100 RayJob? If 100 RayCluster is much faster than 100 RayJob, I think I can locate the root cause and provide a fix.

@Moonquakes
Copy link
Author

@kevin85421 I tested 256 RayClusters (no RayJob) vs 100 RayJobs. Indeed, the latency of Ray Cluster is much better than that of RayJob.
RayJob:
img_v3_02hn_5c094670-4c23-426c-8d1c-e239598b563g
Ray Cluster:
img_v3_02hn_40d55ab5-ed6f-4f69-8a37-a8b1c8816e2g

But I still want to know whether more ray-operator replicas can scale linearly. I think it is difficult to draw reliable conclusions based on black box testing alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants