Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: QVQ-72B-Preview #11479

Open
1 task done
ZB052-A opened this issue Dec 25, 2024 · 10 comments
Open
1 task done

[New Model]: QVQ-72B-Preview #11479

ZB052-A opened this issue Dec 25, 2024 · 10 comments
Labels
new model Requests to new models

Comments

@ZB052-A
Copy link

ZB052-A commented Dec 25, 2024

The model to consider.

https://huggingface.co/Qwen/QVQ-72B-Preview

The closest model vllm already supports.

Qwen2-VL-72B or Qwen2-VL-72B-Instruct

What's your difficulty of supporting the model you want?

At present, it seems that the model only supports single-round dialogue and images, and does not support video input. Hopefully the latest version can at least inference this model, thanks!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ZB052-A ZB052-A added the new model Requests to new models label Dec 25, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 25, 2024

This model uses the same architecture as regular Qwen2-VL, so it should be supported already in vLLM. Just try it!

@DarkLight1337
Copy link
Member

Video input is already supported - how are you passing them to the model?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 25, 2024

I think there is currently some bug with video processing in transformers:

import numpy as np
from transformers import AutoProcessor

# The processor fails when num_frames = 3, 5, 7, ...
video = np.random.randint(0, 255, size=(num_frames, 256, 256, 3), dtype=np.uint8)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
processor(text="<|vision_start|><|video_pad|><|vision_end|>", videos=[video])

Let me report this issue to them.

@DarkLight1337
Copy link
Member

Opened huggingface/transformers#35412

@ZB052-A
Copy link
Author

ZB052-A commented Dec 25, 2024

Video input is already supported - how are you passing them to the model?

I hadn't tested the video input before, but thanks you for the reply!
I'll take the time to do a test of the video input.

@ZB052-A
Copy link
Author

ZB052-A commented Dec 25, 2024

Well, maybe the official documentation doesn't mention supporting video input, so I didn't do the test.
My bad.

@Tian14267
Copy link

Tian14267 commented Dec 27, 2024

@ZB052-A Hello , Can you tell me how to load this model with vllm ? I get torch.OutOfMemoryError: CUDA out of memory when load QVQ-72B-Preview or Qwen2-VL-72B-Instruct
https://github.com/vllm-project/vllm/issues/11560

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 27, 2024

Are you able to run other 72B models with your setup? Maybe it's just that you don't have enough GPU memory. You can use tensor parallelism (-tp option in CLI) to split up the memory usage across GPUs.

@monkeywl2020
Copy link

I think there is currently some bug with video processing in transformers:

import numpy as np
from transformers import AutoProcessor

# The processor fails when num_frames = 3, 5, 7, ...
video = np.random.randint(0, 255, size=(num_frames, 256, 256, 3), dtype=np.uint8)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
processor(text="<|vision_start|><|video_pad|><|vision_end|>", videos=[video])

Let me report this issue to them.

QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.

But It's Not All Perfect: Acknowledging the Limitations

While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:

Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.
Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.

QVQ not support video inputs

@DarkLight1337
Copy link
Member

QVQ not support video inputs

Oh, I missed that part, thanks for bringing this up! Nevertheless, this is still an issue for regular Qwen2-VL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

4 participants