LLaVa 1.5 and 1.6 not working with text-only inputs #35424

giobin · 2024-12-26T17:30:56Z

System Info

transformers version: 4.47.1
Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Python version: 3.12.5
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.4.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: NO
Using GPU in script?: YES
GPU type: NVIDIA A40

Who can help?

@zucchini-nlp, @amyeroberts

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When using text-only inputs with the LLaVa 1.5 and 1.6 family we get an error. I think the issue has already been brought up here bug but the error is different here. Also this is an interesting discussion. The simple code to reproduce:

import torch

import requests
from PIL import Image

from transformers import (
    AutoModelForVision2Seq,
    AutoProcessor
)

MODEL_ID = "llava-hf/llava-v1.6-vicuna-7b-hf" #"llava-hf/llava-1.5-7b-hf"

model = AutoModelForVision2Seq.from_pretrained(MODEL_ID).to(0, torch.bfloat16)
processor = AutoProcessor.from_pretrained(MODEL_ID)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)

output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=False))

The error we get:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.31it/s]
Traceback (most recent call last):
  File "/mnt/llmdata/home/gbonetta/progetti/kimera/test_kimera_checkpoint.py", line 25, in <module>
    inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/models/llava_next/processing_llava_next.py", line 133, in __call__
    images, text = _validate_images_text_input_order(images, text)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/processing_utils.py", line 1205, in _validate_images_text_input_order
    raise ValueError("Invalid input type. Check that `images` and/or `text` are valid inputs.")
ValueError: Invalid input type. Check that `images` and/or `text` are valid inputs.

Expected behavior

The model should run using the image features when provided.

The text was updated successfully, but these errors were encountered:

giobin added the bug label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVa 1.5 and 1.6 not working with text-only inputs #35424

LLaVa 1.5 and 1.6 not working with text-only inputs #35424

giobin commented Dec 26, 2024

LLaVa 1.5 and 1.6 not working with text-only inputs #35424

LLaVa 1.5 and 1.6 not working with text-only inputs #35424

Comments

giobin commented Dec 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior