-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding notebook for Llava-OneVision on multi-image task #470
Comments
Hey @nicokossmann ! Great, the training should be very similar to llava-next yes. You can also use this library (https://github.com/zjysteven/lmms-finetune) for fine-tuning VLMs. Regarding the questions:
|
@zucchini-nlp Thanks for your quick response. Your feedback on the questions was extremely helpful. With regard to the second question, I orientated myself on the provided notebook. We load the base model with the corresponding adapters for inference. # Load the base model with adapters on top
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
"nicokossmann/Llava-OneVision-blink",
torch_dtype=torch.float16,
# torch_dtype=torch.float32,
quantization_config=quantization_config,
) However, if I use fp16, I get the error:
I also noticed that I made a spelling mistake, which means that I can no longer train the model because the input_ids has grown to a size of |
Oh i see, the message is saying your inputs are in fp32 and you prob have to manually casr ipnut to fp16 in data collation/preparation step as For the base image, noted and I'll add it to my TODO list. If you want to give it a try yourself, please feel free to open a PR and tag me 😄 |
I believe this is a common issue with the base image in many models 😅 I am currently working with the Phi-3.5-vision-instruct model and have encountered the same issue. Despite being able to set the number of crops via a parameter, I consistently receive pixel_values of shape |
@nicokossmann I would day it depends on whether the model should be supporting base image only setting, because some models like llava-next are never tuned with only one image. If you want to tune Llava with more freedom for different parameters, I'd recommend to use the official repo (LLaVA-VL) which allows setting any combination of params. Later it can be converted to HF format for inference :) For Phi-3.5, if you believe the model should support base image only, feel free to open a discussion on the hub. Since the model is |
I tried to fix the problem with the base image support, but I've got stuck on an error message that I can't solve
I have two base images Do you have any idea what the error could be? |
Hey @zucchini-nlp and @NielsRogge👋,
I created a notebook for fine-tuning Llava-OneVision-0.5b-ov-hf on the BLINK Benchmark, based on the notebook of LLAVA-NeXT.
This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision.
During the implementation, a few questions arose:
The text was updated successfully, but these errors were encountered: