You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
When using text-only inputs with the LLaVa 1.5 and 1.6 family we get an error. I think the issue has already been brought up here bug but the error is different here. Also this is an interesting discussion. The simple code to reproduce:
import torch
import requests
from PIL import Image
from transformers import (
AutoModelForVision2Seq,
AutoProcessor
)
MODEL_ID = "llava-hf/llava-v1.6-vicuna-7b-hf" #"llava-hf/llava-1.5-7b-hf"
model = AutoModelForVision2Seq.from_pretrained(MODEL_ID).to(0, torch.bfloat16)
processor = AutoProcessor.from_pretrained(MODEL_ID)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=False))
The error we get:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.31it/s]
Traceback (most recent call last):
File "/mnt/llmdata/home/gbonetta/progetti/kimera/test_kimera_checkpoint.py", line 25, in <module>
inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/models/llava_next/processing_llava_next.py", line 133, in __call__
images, text = _validate_images_text_input_order(images, text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/processing_utils.py", line 1205, in _validate_images_text_input_order
raise ValueError("Invalid input type. Check that `images` and/or `text` are valid inputs.")
ValueError: Invalid input type. Check that `images` and/or `text` are valid inputs.
Expected behavior
The model should run using the image features when provided.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.47.1- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@zucchini-nlp, @amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When using text-only inputs with the LLaVa 1.5 and 1.6 family we get an error. I think the issue has already been brought up here bug but the error is different here. Also this is an interesting discussion. The simple code to reproduce:
The error we get:
Expected behavior
The model should run using the image features when provided.
The text was updated successfully, but these errors were encountered: