[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

githust66 · 2024-12-26T02:31:18Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

There are no CUDA-related libraries in the rocm environment, but the SGLANG 0.4.1 version will report an error, while the 0.4.0 and earlier versions will not

error info:
ImportError: [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
2024-12-26 10:14:09,664 xinference.api.restful_api 4247 ERROR [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1041, in launch_builtin_model
await _launch_model()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1005, in _launch_model
await _launch_one_model(rep_model_uid)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/supervisor.py", line 984, in _launch_one_model
await worker_ref.launch_builtin_model(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/utils.py", line 90, in wrapped
ret = await func(*args, **kwargs)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/worker.py", line 897, in launch_builtin_model
await model_ref.load()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/core/model.py", line 414, in load
self._model.load()
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/xinference/model/llm/sglang/core.py", line 135, in load
self._engine = sgl.Runtime(
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/api.py", line 39, in Runtime
from sglang.srt.server import Runtime
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/server.py", line 47, in
from sglang.srt.managers.data_parallel_controller import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/data_parallel_controller.py", line 25, in
from sglang.srt.managers.io_struct import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/io_struct.py", line 24, in
from sglang.srt.managers.schedule_batch import BaseFinishReason
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/managers/schedule_batch.py", line 40, in
from sglang.srt.configs.model_config import ModelConfig
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/configs/model_config.py", line 24, in
from sglang.srt.layers.quantization import QUANTIZATION_METHODS
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/quantization/init.py", line 25, in
from sglang.srt.layers.quantization.fp8 import Fp8Config
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/quantization/fp8.py", line 31, in
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import padding_size
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/init.py", line 4, in
import sglang.srt.layers.moe.fused_moe_triton.fused_moe # noqa
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 14, in
from sgl_kernel import moe_align_block_size as sgl_moe_align_block_size
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel/init.py", line 1, in
from .ops import (
File "/root/miniconda3/envs/xinf/lib/python3.10/site-packages/sgl_kernel/ops/init.py", line 1, in
from .custom_reduce_cuda import all_reduce as _all_reduce
ImportError: [address=0.0.0.0:39501, pid=13418] libcudart.so.12: cannot open shared object file: No such file or directory
2024-12-26 10:14:09,665 uvicorn.access 4247 INFO 127.0.0.1:47452 - "POST /v1/models HTTP/1.1" 500

Reproduction

qwen2.5-instruct-7B

Environment

(xinf) root@DESKTOP-ESRGKIB:~# python -m sglang.check_env
2024-12-26 10:26:55.241465: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: FFT
2024-12-26 10:26:56.971386: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-26 10:26:57.864187: E external/local_xla/xla/stream_executor/plugin_registry.cc:91] Invalid plugin kind specified: DNN
WARNING 12-26 10:27:06 rocm.py:31] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
ROCM available: True
GPU 0: AMD Radeon RX 7900 XT
GPU 0 Compute Capability: 11.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version:
PyTorch: 2.4.0+rocm6.3.0
sglang: 0.4.1
flashinfer: Module Not Found
triton: 3.0.0+rocm6.3.0_75cc27c26a
transformers: 4.46.2
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.10.10
fastapi: 0.115.4
hf_transfer: 0.1.8
huggingface_hub: 0.26.5
interegular: 0.3.3
modelscope: 1.19.2
orjson: 3.10.11
packaging: 24.1
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.12
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.6.dev44+gc2d1b075.d20241221
openai: 1.54.1
anthropic: 0.39.0
decord: 0.6.0
AMD Topology:

Hypervisor vendor: Microsoft
ulimit soft: 1024

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-12-26T07:32:09Z

Hi @githust66 Could you help verify this #2590

githust66 · 2024-12-26T07:39:22Z

Hi @githust66 Could you help verify this #2590

ok，Is it pulling the latest code to build from source?

zhyncs · 2024-12-26T07:41:31Z

Nope. You only need to change the Python code.

githust66 · 2024-12-26T07:46:37Z

Nope. You only need to change the Python code.

ok, I'll give it a try

githust66 · 2024-12-26T08:01:42Z

Nope. You only need to change the Python code.

I modified the following two files, but the result is still the same error, the first file cannot be found on my environment

zhyncs · 2024-12-26T08:04:56Z

python3 -c "import sgl_kernel; print(sgl_kernel.__path__)"

githust66 · 2024-12-26T08:53:37Z

python3 -c "import sgl_kernel; print(sgl_kernel.__path__)"

These two init.py documents have been changed

zhyncs · 2024-12-26T18:54:43Z

fixed with #2601

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

githust66 commented Dec 26, 2024

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024

githust66 commented Dec 26, 2024

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024 •

edited

Loading

zhyncs commented Dec 26, 2024

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

[Bug] libcudart.so.12: cannot open shared object file: No such file or directory #2584

Comments

githust66 commented Dec 26, 2024

Checklist

Describe the bug

Reproduction

Environment

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024

githust66 commented Dec 26, 2024

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024 • edited Loading

zhyncs commented Dec 26, 2024

githust66 commented Dec 26, 2024 •

edited

Loading