You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to serve llama3.1-8b using TGI on A10 (24G) on context length 4k.
coomand:
docker run --gpus all -it --rm -p 8000:80 ghcr.io/huggingface/text-generation-inference:3.0.0 --model-id NousResearch/Meta-Llama-3.1-8B-Instruct --max-total-tokens 4096 --dtype bfloat16
However it work with the same command using image ghcr.io/huggingface/text-generation-inference:2.2.0
but i got the following error:
2024-12-10T21:24:12.674619Z INFO text_generation_launcher: Starting Webserver
2024-12-10T21:24:12.849356Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-10T21:25:42.531534Z ERROR warmup{max_input_length=None max_prefill_tokens=8192 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2024-12-10T21:25:42.679824Z ERROR text_generation_launcher: Webserver Crashed
2024-12-10T21:25:42.684321Z INFO text_generation_launcher: Shutting down shards
2024-12-10T21:25:42.698301Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-12-10 21:23:52.620 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, **kwargs) rank=0
2024-12-10T21:25:42.700830Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 9 rank=0
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
docker run --gpus all -it --rm -p 8000:80 ghcr.io/huggingface/text-generation-inference:3.0.0 --model-id NousResearch/Meta-Llama-3.1-8B-Instruct --max-total-tokens 4096 --dtype bfloat16
Expected behavior
Should serve the model successfully
The text was updated successfully, but these errors were encountered:
System Info
I tried to serve llama3.1-8b using TGI on A10 (24G) on context length 4k.
coomand:
ghcr.io/huggingface/text-generation-inference:2.2.0
but i got the following error:
Information
Tasks
Reproduction
Expected behavior
Should serve the model successfully
The text was updated successfully, but these errors were encountered: