Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cannot run bitsandbytes llama models #2600

Closed
merrymercy opened this issue Dec 26, 2024 · 3 comments
Closed

[Bug] Cannot run bitsandbytes llama models #2600

merrymercy opened this issue Dec 26, 2024 · 3 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@merrymercy
Copy link
Contributor

merrymercy commented Dec 26, 2024

The issue is the same as #2556, but for llama models. We should be able to fix with a similar approach.

The following command crashes.

python3 -m sglang.bench_one_batch --model unsloth/llama-3-8b-bnb-4bit --load-format bitsandbytes
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 470, in <module>
[rank0]:     main(server_args, bench_args)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 434, in main
[rank0]:     work_func(server_args, port_args, bench_args, 0)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 369, in latency_test
[rank0]:     model_runner, tokenizer = load_model(server_args, port_args, tp_rank)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 121, in load_model
[rank0]:     model_runner = ModelRunner(
[rank0]:   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 158, in __init__
[rank0]:     self.load_model()
[rank0]:   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 258, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
[rank0]:     return loader.load_model(
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/loader.py", line 1029, in load_model
[rank0]:     self._load_weights(model_config, model)
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/loader.py", line 960, in _load_weights
[rank0]:     model.load_weights(qweight_iterator)
[rank0]:   File "/root/sglang/python/sglang/srt/models/llama.py", line 442, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.qweight'
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:01<?, ?it/s]
@merrymercy merrymercy added the good first issue Good for newcomers label Dec 26, 2024
@merrymercy merrymercy added the help wanted Extra attention is needed label Dec 26, 2024
@upskyy
Copy link
Contributor

upskyy commented Dec 27, 2024

@merrymercy
Looking at the code, I think the issue will also be resolved with the PR #2557. This problem occurred when changing to qweight when doing bitsandbytes 4bit load, but in the PR #2557, it was corrected not only for the gemma model but also for loads of other models.

@bkodes
Copy link

bkodes commented Dec 29, 2024

did #2557 resolve this? @upskyy

@merrymercy
Copy link
Contributor Author

Yes. It is solved! Thanks for the fix @upskyy . It turns out I am using some old code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants