Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] autoscaler crash when resources.limt not set #2612

Open
1 of 2 tasks
Tracked by #2600
zjj2wry opened this issue Dec 5, 2024 · 1 comment
Open
1 of 2 tasks
Tracked by #2600

[Bug] autoscaler crash when resources.limt not set #2612

zjj2wry opened this issue Dec 5, 2024 · 1 comment
Labels
autoscaler bug Something isn't working

Comments

@zjj2wry
Copy link

zjj2wry commented Dec 5, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

ray head logs:

  containers:
  - args:
    - 'ulimit -n 65536; ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --memory=4294967296  --metrics-export-port=8080  --no-monitor  --num-cpus=2 '
    command:
    - /bin/bash
    - -lc
    - --
    env:
    - name: SHELL
      value: /bin/bash
    - name: RAY_CLUSTER_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['ray.io/cluster']
    - name: RAY_CLOUD_INSTANCE_ID
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: RAY_NODE_TYPE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['ray.io/group']
    - name: KUBERAY_GEN_RAY_START_CMD
      value: 'ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --memory=4294967296  --metrics-export-port=8080  --no-monitor  --num-cpus=2 '
    image: xxx/ray:2.38.0-py39-cpu
    imagePullPolicy: IfNotPresent
    name: ray-head
...
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: "2"
        memory: 4Gi

autoscaler log:

Set the `--num-cpus` rayStartParam and/or the CPU resource limit for the Ray container.
The Ray head is ready. Starting the autoscaler.
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2614, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2342, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 76, in run_kuberay_autoscaler
    Monitor(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 583, in run
    self._initialize_autoscaler()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 231, in _initialize_autoscaler
    self.autoscaler = StandardAutoscaler(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 251, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1122, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1035, in reset
    new_config = self.config_reader()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    autoscaling_config = _derive_autoscaling_config_from_ray_cr(ray_cr)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 96, in _derive_autoscaling_config_from_ray_cr
    available_node_types = _generate_available_node_types_from_ray_cr_spec(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 195, in _generate_available_node_types_from_ray_cr_spec
    _HEAD_GROUP_NAME: _node_type_from_group_spec(headGroupSpec, is_head=True),
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 217, in _node_type_from_group_spec
    resources = _get_ray_resources_from_group_spec(group_spec, is_head)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 249, in _get_ray_resources_from_group_spec
    num_cpus = _get_num_cpus(ray_start_params, k8s_resource_limits, group_name)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 316, in _get_num_cpus
    raise ValueError(
ValueError: Autoscaler failed to detect `CPU` resources for group head-group.
Set the `--num-cpus` rayStartParam and/or the CPU resource limit for the Ray container.

Reproduction script

create a raycluster not set resources.limt

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@zjj2wry zjj2wry added bug Something isn't working triage labels Dec 5, 2024
@andrewsykim
Copy link
Collaborator

I believe #2365 should fix this since we will set --num-cpus based on requests if limits is no longer set. However, this will only available starting from KubeRay v1.3. In the meantime you need to set num-cpus in rayStartParams to match what you specfieid in the CPU requests.

cc @ryanaoleary @kevin85421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoscaler bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants