Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues after upgrading to 1.79 #1248

Open
Circl3s opened this issue Dec 2, 2024 · 7 comments
Open

Issues after upgrading to 1.79 #1248

Circl3s opened this issue Dec 2, 2024 · 7 comments

Comments

@Circl3s
Copy link

Circl3s commented Dec 2, 2024

I'm using a Ministral 8B Instruct Q4_K_M model fully offloaded onto an Arc A380 GPU using the Vulkan backend on Linux. Everything was working fine on KoboldCPP 1.76, except for the fact that for longer prompts I was getting a DeviceLost error, so I upgraded to 1.79.1 to see if it was a bug that was fixed. Setting a lower BLAS batch size seems to have fixed it, but I encountered two other problems.

  1. There seems to have been an undocumented change to the API response format, as my custom client fails to interpret the responses with an "Expected String but was Null" error.
  2. Inference is MUCH slower (like 50% slower), probably due to the
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type Vulkan_Host, using CPU instead
(This is not an error, it just means some tensors will use CPU instead.)

which didn't appear in 1.76.
Any help with these would be greatly appreciated.

@LostRuins
Copy link
Owner

Which API format are you using? Could you list the field that was expected to be a string, but was null?

For inference speeds, that could be due to vulkan kernel changes upstream and the backend refactor. It should have nothing to do with offloading, that error is just only 1 single layer.

@AlexysLovesLexxie
Copy link

Which API format are you using? Could you list the field that was expected to be a string, but was null?

For inference speeds, that could be due to vulkan kernel changes upstream and the backend refactor. It should have nothing to do with offloading, that error is just only 1 single layer.

Can confirm that this exact same issue (token embed weight) is happening when using Nvidia GeForce 3060 12Gb, so it doesn't seem to be Vulkan exclusive. Also noticed that CPU us being used over and above GPU during response generation now, on all models I have tried, despite whether the model is offloading or it it's small enough to fit into VRAM.

@d0x360
Copy link

d0x360 commented Dec 10, 2024

can also confirm, 4090

@ATStUrNa
Copy link

ATStUrNa commented Dec 11, 2024

I had this issue since version koboldcpp-rocm-1.78.yr0-ROCm. 7900XTX + 2x 7600XT
llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead (This is not an error, it just means some tensors will use CPU instead.)

Version 1.77.yr1-ROCm was the last version without this behavior.
Also I had to lower the blas batch size from 512 to 32 so that larger models work in the newer versions.
Otherwise I get an out of memory error.

@thijsi123
Copy link

thijsi123 commented Dec 12, 2024

Same (using 256 blas) on rtx 3090 + rtx 4070 ti s
llm_load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CP2>±¾O(This is not an error, it just means some tensors will use CPU instead.)

@henk717
Copy link

henk717 commented Dec 19, 2024

token_embd.weight never offloads to the GPU this is nornal behavior. Its about the rest of the layers that do / don't offload.

@LostRuins
Copy link
Owner

Please try v1.80, should be less messy looking now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants