You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model inference on Windows LNL NPU for openai/clip-vit-large-patch14 is not working. Error observed is as follows.
[ERROR] 05:26:28.301 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]) : Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]): error: Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
LLVM ERROR: Failed to infer result type(s).
import requests
import numpy as np
import openvino as ov
from scipy.special import softmax
from PIL import Image
from pathlib import Path
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
classes = ["a photo of a cat", "a photo of a dog"]
inputs = processor(text=classes, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
predicted_idx = probs.argmax().item()
print(classes[predicted_idx])
[ERROR] 05:26:28.301 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]) : Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]): error: Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
LLVM ERROR: Failed to infer result type(s).
Issue submission checklist
I'm reporting an issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.
The text was updated successfully, but these errors were encountered:
This is a Windows system and I have the latest drivers. I might have made a mistake while filing the bug by choosing Ubuntu. Please read it as Windows as mentioned in the description.
@azhuvath we tested on a MTL system and we see some issues with this model as well, we have captured this as a possible bug. Will share more details as we have them.
Exception from src/plugins/intel_npu/src/compiler_adapter/src/ze_graph_ext_wrappers.cpp:389:
L0 pfnCreate2 result: ZE_RESULT_ERROR_UNKNOWN, code 0x7ffffffe - an action is required to complete the desired operation . Check 'min_val == max_val' failed at src/core/src/partial_shape.cpp:129:
get_shape() must be called on a static shape
OpenVINO Version
2024.6
Operating System
Windows
Device used for inference
NPU
Framework
None
Model used
openai/clip-vit-large-patch14
Issue description
Model inference on Windows LNL NPU for openai/clip-vit-large-patch14 is not working. Error observed is as follows.
[ERROR] 05:26:28.301 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]) : Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
loc(fused<{name = "__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution", type = "Convolution"}>["__module.vision_model.embeddings.patch_embedding/aten::_convolution/Convolution"]): error: Channels count of input tensor shape and filter shape must be the same: -9223372036854775808 != 3
LLVM ERROR: Failed to infer result type(s).
Step-by-step reproduction
Create Environment
python -m venv npu_env
./npu_env/Scripts/activate
python -m pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install pillow scikit-learn requests transformers openvino
Code to execute. Change CPU to NPU
import requests
import numpy as np
import openvino as ov
from scipy.special import softmax
from PIL import Image
from pathlib import Path
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
classes = ["a photo of a cat", "a photo of a dog"]
inputs = processor(text=classes, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
predicted_idx = probs.argmax().item()
print(classes[predicted_idx])
ov_model_path = "clip-vit-large-patch14-fp32.xml"
fp32_model_path = Path(ov_model_path)
model.config.torchscript = True
ov_model = ov.convert_model(model, example_input=dict(inputs))
ov.save_model(ov_model, fp32_model_path, compress_to_fp16=False)
device = 'NPU'
core = ov.Core()
compiled_model = core.compile_model(ov_model_path, device)
inputs = dict(inputs)
outputs = compiled_model(inputs)[0]
probs = softmax(outputs, axis=1)
[predicted_idx] = np.argmax(probs, axis=1)
print(classes[predicted_idx])
Relevant log output
Issue submission checklist
The text was updated successfully, but these errors were encountered: