[New Model]: QVQ-72B-Preview #11479

ZB052-A · 2024-12-25T04:25:56Z

The model to consider.

https://huggingface.co/Qwen/QVQ-72B-Preview

The closest model vllm already supports.

Qwen2-VL-72B or Qwen2-VL-72B-Instruct

What's your difficulty of supporting the model you want?

At present, it seems that the model only supports single-round dialogue and images, and does not support video input. Hopefully the latest version can at least inference this model, thanks!

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-12-25T06:36:38Z

This model uses the same architecture as regular Qwen2-VL, so it should be supported already in vLLM. Just try it!

DarkLight1337 · 2024-12-25T06:37:22Z

Video input is already supported - how are you passing them to the model?

DarkLight1337 · 2024-12-25T06:50:36Z

I think there is currently some bug with video processing in transformers:

import numpy as np
from transformers import AutoProcessor

# The processor fails when num_frames = 3, 5, 7, ...
video = np.random.randint(0, 255, size=(num_frames, 256, 256, 3), dtype=np.uint8)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
processor(text="<|vision_start|><|video_pad|><|vision_end|>", videos=[video])

Let me report this issue to them.

DarkLight1337 · 2024-12-25T06:58:39Z

Opened huggingface/transformers#35412

ZB052-A · 2024-12-25T12:14:12Z

Video input is already supported - how are you passing them to the model?

I hadn't tested the video input before, but thanks you for the reply!
I'll take the time to do a test of the video input.

ZB052-A · 2024-12-25T12:23:59Z

Well, maybe the official documentation doesn't mention supporting video input, so I didn't do the test.
My bad.

Tian14267 · 2024-12-27T09:11:09Z

@ZB052-A Hello , Can you tell me how to load this model with vllm ? I get torch.OutOfMemoryError: CUDA out of memory when load QVQ-72B-Preview or Qwen2-VL-72B-Instruct
https://github.com/vllm-project/vllm/issues/11560

DarkLight1337 · 2024-12-27T09:17:56Z

Are you able to run other 72B models with your setup? Maybe it's just that you don't have enough GPU memory. You can use tensor parallelism (-tp option in CLI) to split up the memory usage across GPUs.

monkeywl2020 · 2024-12-27T09:28:19Z

I think there is currently some bug with video processing in transformers:

import numpy as np
from transformers import AutoProcessor

# The processor fails when num_frames = 3, 5, 7, ...
video = np.random.randint(0, 255, size=(num_frames, 256, 256, 3), dtype=np.uint8)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
processor(text="<|vision_start|><|video_pad|><|vision_end|>", videos=[video])

Let me report this issue to them.

QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.

But It's Not All Perfect: Acknowledging the Limitations

While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:

Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.
Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.

QVQ not support video inputs

DarkLight1337 · 2024-12-27T09:30:05Z

QVQ not support video inputs

Oh, I missed that part, thanks for bringing this up! Nevertheless, this is still an issue for regular Qwen2-VL.

ZB052-A added the new model Requests to new models label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: QVQ-72B-Preview #11479

[New Model]: QVQ-72B-Preview #11479

ZB052-A commented Dec 25, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024

DarkLight1337 commented Dec 25, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024

ZB052-A commented Dec 25, 2024

ZB052-A commented Dec 25, 2024

Tian14267 commented Dec 27, 2024 •

edited

Loading

DarkLight1337 commented Dec 27, 2024 •

edited

Loading

monkeywl2020 commented Dec 27, 2024

DarkLight1337 commented Dec 27, 2024

[New Model]: QVQ-72B-Preview #11479

[New Model]: QVQ-72B-Preview #11479

Comments

ZB052-A commented Dec 25, 2024 • edited Loading

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

DarkLight1337 commented Dec 25, 2024 • edited Loading

DarkLight1337 commented Dec 25, 2024

DarkLight1337 commented Dec 25, 2024 • edited Loading

DarkLight1337 commented Dec 25, 2024

ZB052-A commented Dec 25, 2024

ZB052-A commented Dec 25, 2024

Tian14267 commented Dec 27, 2024 • edited Loading

DarkLight1337 commented Dec 27, 2024 • edited Loading

monkeywl2020 commented Dec 27, 2024

DarkLight1337 commented Dec 27, 2024

ZB052-A commented Dec 25, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024 •

edited

Loading

Tian14267 commented Dec 27, 2024 •

edited

Loading

DarkLight1337 commented Dec 27, 2024 •

edited

Loading