Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with 4-bit Quantization for LLaVA-NeXT-Video-32B Model on A100-40GB GPU #1791

Open
Rachel0901 opened this issue Dec 7, 2024 · 0 comments

Comments

@Rachel0901
Copy link

Describe the issue

Hello, I am trying to run the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model on an A100-40GB GPU. However, I encounter an OOM issue when loading the model in its default configuration. To address this, I attempted to enable 4-bit quantization using the bitsandbytes library by modifying my script as follows:

pretrained = "lmms-lab/LLaVA-NeXT-Video-32B-Qwen"
model_name = "llava_qwen"
device_map = "auto"

# Load the model with proper configuration
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained,
    None,
    model_name,
    load_in_8bit=False,  # Ensure 8-bit quantization is disabled
    load_in_4bit=True    # Enable 4-bit quantization
)
model.eval()

However, when I run the script, I encounter the following error message:

ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models.
Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

Could you clarify how to properly enable 4-bit quantization for the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model in Python scripts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant