-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unkown compute for card nvidia-a100-80gb-pcie #2822
Comments
We have similar warning with H100:
which does not match text-generation-inference/launcher/src/main.rs Line 1671 in 82c24f7
|
@marceljahnke |
same issue here: Unkown compute for card nvidia-a100-80gb-pcie |
System Info
When trying the new TGI v3.0.0 using an AzureML Compute Standard_NC24ads_A100_v4 (https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizebasic) TGI is not able to detect properly the gpu.
The name doesn't match with the ones listed for A100 cards in
text-generation-inference/launcher/src/main.rs
Line 1655 in 82c24f7
Trying to deploy Qwen2.5-7B-Coder in full precision i got an error during the Prefill Method. TGI v2.4.0 works without issues, v2.4.1 also fails (it already has some auto values i think).
Complete stack trace without using any values in the Config/Envs:
Instance status:
SystemSetup: Succeeded
UserContainerImagePull: Succeeded
ModelDownload: Succeeded
UserContainerStart: InProgress
Container events:
Kind: Pod, Name: Downloading, Type: Normal, Time: 2024-12-10T16:29:38.664423Z, Message: Start downloading models
Kind: Pod, Name: Pulling, Type: Normal, Time: 2024-12-10T16:29:39.436264Z, Message: Start pulling container image
Kind: Pod, Name: Pulled, Type: Normal, Time: 2024-12-10T16:33:13.545118Z, Message: Container image is pulled successfully
Kind: Pod, Name: Downloaded, Type: Normal, Time: 2024-12-10T16:33:13.545118Z, Message: Models are downloaded successfully
Kind: Pod, Name: Created, Type: Normal, Time: 2024-12-10T16:33:13.570232Z, Message: Created container inference-server
Kind: Pod, Name: Started, Type: Normal, Time: 2024-12-10T16:33:13.772813Z, Message: Started container inference-server
Kind: Pod, Name: ReadinessProbeFailed, Type: Warning, Time: 2024-12-10T16:34:22.215502Z, Message: Readiness probe failed: HTTP probe failed with statuscode: 503
Container logs:
�[2m2024-12-10T16:33:13.791677Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Args {
model_id: "/var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "mir-user-pod-2233f29894a94ca084027bb52cf77929000000",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
�[2m2024-12-10T16:33:15.143210Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using attention flashinfer - Prefix caching true
�[2m2024-12-10T16:33:15.180035Z�[0m �[33m WARN�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Unkown compute for card nvidia-a100-80gb-pcie
�[2m2024-12-10T16:33:15.217613Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Default
max_batch_prefill_tokens
to 4096�[2m2024-12-10T16:33:15.217633Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using default cuda graphs [1, 2, 4, 8, 16, 32]
�[2m2024-12-10T16:33:15.217638Z�[0m �[33m WARN�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m
trust_remote_code
is set. Trusting that model/var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
do not contain malicious code.�[2m2024-12-10T16:33:15.217908Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Starting check and download process for /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:17.736806Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Files are already present on the host. Skipping download.
�[2m2024-12-10T16:33:18.129901Z�[0m �[32m INFO�[0m �[1mdownload�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Successfully downloaded weights for /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:18.130146Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Starting shard �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:20.664399Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using prefix caching = True
�[2m2024-12-10T16:33:20.664453Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using Attention = flashinfer
�[2m2024-12-10T16:33:28.150232Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Waiting for shard to be ready... �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:31.029573Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using prefill chunking = True
�[2m2024-12-10T16:33:31.597225Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Server started at unix:///tmp/text-generation-server-0
�[2m2024-12-10T16:33:31.655611Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Shard ready in 13.518157709s �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:33:31.743685Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting Webserver
�[2m2024-12-10T16:33:31.815698Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m125:�[0m Warming up model
�[2m2024-12-10T16:33:31.836887Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Using optimized Triton indexing kernels.
�[2m2024-12-10T16:33:33.621609Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m KV-cache blocks: 1110376, size: 1
�[2m2024-12-10T16:33:33.634080Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
�[2m2024-12-10T16:33:34.610076Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m137:�[0m Setting max batch total tokens to 1110376
�[2m2024-12-10T16:33:34.610095Z�[0m �[33m WARN�[0m �[2mtext_generation_router_v3::backend�[0m�[2m:�[0m �[2mbackends/v3/src/backend.rs�[0m�[2m:�[0m�[2m39:�[0m Model supports prefill chunking.
waiting_served_ratio
andmax_waiting_tokens
will be ignored.�[2m2024-12-10T16:33:34.610112Z�[0m �[32m INFO�[0m �[2mtext_generation_router_v3�[0m�[2m:�[0m �[2mbackends/v3/src/lib.rs�[0m�[2m:�[0m�[2m166:�[0m Using backend V3
�[2m2024-12-10T16:33:34.610116Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mbackends/v3/src/main.rs�[0m�[2m:�[0m�[2m162:�[0m Maximum input tokens defaulted to 32767
�[2m2024-12-10T16:33:34.610119Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mbackends/v3/src/main.rs�[0m�[2m:�[0m�[2m168:�[0m Maximum total tokens defaulted to 32768
�[2m2024-12-10T16:33:36.853347Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m1873:�[0m Using config Some(Qwen2)
�[2m2024-12-10T16:33:36.979425Z�[0m �[33m WARN�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m1913:�[0m no pipeline tag found for model /var/azureml-app/azureml-models/Qwen2_5-Coder-7B-BC/2/output_dir
�[2m2024-12-10T16:33:36.979442Z�[0m �[33m WARN�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2015:�[0m Invalid hostname, defaulting to 0.0.0.0
�[2m2024-12-10T16:33:37.033972Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2402:�[0m Connected
�[2m2024-12-10T16:34:22.215030Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
2024-12-10 16:33:19.507 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored./opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, *kwargs)
CUDA Error: an illegal memory access was encountered (700) /tmp/build-via-sdist-fmqwe4he/flashinfer-0.1.6+cu124torch2.4/include/flashinfer/attention/prefill.cuh: line 2370 at function cudaLaunchKernel((void)kernel, nblks, nthrs, args, smem_size, stream) �[2m�[3mrank�[0m�[2m=�[0m0�[0m
�[2m2024-12-10T16:34:23.167755Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shard 0 crashed
�[2m2024-12-10T16:34:23.167784Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Terminating webserver
�[2m2024-12-10T16:34:23.167801Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Waiting for webserver to gracefully shutdown
�[2m2024-12-10T16:34:23.167952Z�[0m �[32m INFO�[0m �[2mtext_generation_router::server�[0m�[2m:�[0m �[2mrouter/src/server.rs�[0m�[2m:�[0m�[2m2494:�[0m signal received, starting graceful shutdown
�[2m2024-12-10T16:34:23.868638Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m webserver terminated
�[2m2024-12-10T16:34:23.868660Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Shutting down shards
Error: ShardFailed
Information
Tasks
Reproduction
none
Expected behavior
Card get detected as A100.
No CUDA Error
The text was updated successfully, but these errors were encountered: