Issues after upgrading to 1.79 #1248

Circl3s · 2024-12-02T15:01:59Z

I'm using a Ministral 8B Instruct Q4_K_M model fully offloaded onto an Arc A380 GPU using the Vulkan backend on Linux. Everything was working fine on KoboldCPP 1.76, except for the fact that for longer prompts I was getting a DeviceLost error, so I upgraded to 1.79.1 to see if it was a bug that was fixed. Setting a lower BLAS batch size seems to have fixed it, but I encountered two other problems.

There seems to have been an undocumented change to the API response format, as my custom client fails to interpret the responses with an "Expected String but was Null" error.
Inference is MUCH slower (like 50% slower), probably due to the

llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
(This is not an error, it just means some tensors will use CPU instead.)

which didn't appear in 1.76.
Any help with these would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

LostRuins · 2024-12-03T11:34:45Z

Which API format are you using? Could you list the field that was expected to be a string, but was null?

For inference speeds, that could be due to vulkan kernel changes upstream and the backend refactor. It should have nothing to do with offloading, that error is just only 1 single layer.

AlexysLovesLexxie · 2024-12-09T04:48:18Z

Which API format are you using? Could you list the field that was expected to be a string, but was null?

For inference speeds, that could be due to vulkan kernel changes upstream and the backend refactor. It should have nothing to do with offloading, that error is just only 1 single layer.

Can confirm that this exact same issue (token embed weight) is happening when using Nvidia GeForce 3060 12Gb, so it doesn't seem to be Vulkan exclusive. Also noticed that CPU us being used over and above GPU during response generation now, on all models I have tried, despite whether the model is offloading or it it's small enough to fit into VRAM.

d0x360 · 2024-12-10T11:26:35Z

can also confirm, 4090

ATStUrNa · 2024-12-11T17:33:10Z

I had this issue since version koboldcpp-rocm-1.78.yr0-ROCm. 7900XTX + 2x 7600XT
llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead (This is not an error, it just means some tensors will use CPU instead.)

Version 1.77.yr1-ROCm was the last version without this behavior.
Also I had to lower the blas batch size from 512 to 32 so that larger models work in the newer versions.
Otherwise I get an out of memory error.

thijsi123 · 2024-12-12T10:55:51Z

Same (using 256 blas) on rtx 3090 + rtx 4070 ti s
llm_load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CP2>±¾O(This is not an error, it just means some tensors will use CPU instead.)

henk717 · 2024-12-19T10:13:57Z

token_embd.weight never offloads to the GPU this is nornal behavior. Its about the rest of the layers that do / don't offload.

LostRuins · 2024-12-20T06:06:48Z

Please try v1.80, should be less messy looking now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues after upgrading to 1.79 #1248

Issues after upgrading to 1.79 #1248

Circl3s commented Dec 2, 2024

LostRuins commented Dec 3, 2024

AlexysLovesLexxie commented Dec 9, 2024

d0x360 commented Dec 10, 2024

ATStUrNa commented Dec 11, 2024 •

edited

Loading

thijsi123 commented Dec 12, 2024 •

edited

Loading

henk717 commented Dec 19, 2024

LostRuins commented Dec 20, 2024

Issues after upgrading to 1.79 #1248

Issues after upgrading to 1.79 #1248

Comments

Circl3s commented Dec 2, 2024

LostRuins commented Dec 3, 2024

AlexysLovesLexxie commented Dec 9, 2024

d0x360 commented Dec 10, 2024

ATStUrNa commented Dec 11, 2024 • edited Loading

thijsi123 commented Dec 12, 2024 • edited Loading

henk717 commented Dec 19, 2024

LostRuins commented Dec 20, 2024

ATStUrNa commented Dec 11, 2024 •

edited

Loading

thijsi123 commented Dec 12, 2024 •

edited

Loading