Failure when start the model using TGI 3 #2819

hahmad2008 · 2024-12-10T21:36:23Z

System Info

I tried to serve llama3.1-8b using TGI on A10 (24G) on context length 4k.
coomand:

 docker run --gpus all -it --rm -p 8000:80 ghcr.io/huggingface/text-generation-inference:3.0.0   --model-id NousResearch/Meta-Llama-3.1-8B-Instruct --max-total-tokens 4096  --dtype bfloat16

However it work with the same command using image ghcr.io/huggingface/text-generation-inference:2.2.0

but i got the following error:

2024-12-10T21:24:12.674619Z  INFO text_generation_launcher: Starting Webserver
2024-12-10T21:24:12.849356Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-10T21:25:42.531534Z ERROR warmup{max_input_length=None max_prefill_tokens=8192 max_total_tokens=Some(4096) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2024-12-10T21:25:42.679824Z ERROR text_generation_launcher: Webserver Crashed
2024-12-10T21:25:42.684321Z  INFO text_generation_launcher: Shutting down shards
2024-12-10T21:25:42.698301Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-12-10 21:23:52.620 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  return func(*args, **kwargs) rank=0
2024-12-10T21:25:42.700830Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 9 rank=0

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

 docker run --gpus all -it --rm -p 8000:80 ghcr.io/huggingface/text-generation-inference:3.0.0   --model-id NousResearch/Meta-Llama-3.1-8B-Instruct --max-total-tokens 4096  --dtype bfloat16

Expected behavior

Should serve the model successfully

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure when start the model using TGI 3 #2819

Failure when start the model using TGI 3 #2819

hahmad2008 commented Dec 10, 2024 •

edited

Loading

Failure when start the model using TGI 3 #2819

Failure when start the model using TGI 3 #2819

Comments

hahmad2008 commented Dec 10, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

hahmad2008 commented Dec 10, 2024 •

edited

Loading