Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm #2825

mht-sharma · 2024-12-11T11:51:42Z

What does this PR do?

Fixes following issues:

Use FP8 per tensor scales for ROCM (wherever possible) to leverage scaled_mm efficiently.
Fixes random NAN issues with custom PA in rocm
Uses marlin_kernels and moe_kernels repositories for ROCM

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

danieldk · 2024-12-16T09:37:20Z

Dockerfile_amd

+RUN git clone https://github.com/danieldk/marlin-kernels.git && \
+    cd marlin-kernels && \
+    git checkout ${MARLIN_KERNELS_BRANCH} && \
+    python setup.py install


Not sure how hard it'll be, but it would be nice to have precompiled wheels in the future to cut down the build times.

danieldk · 2024-12-16T09:42:11Z

server/text_generation_server/layers/fp8.py

+def per_tensor_dequantize(
+    tensor: torch.Tensor, inv_scale: Union[float, torch.Tensor]
+) -> torch.Tensor:
+    fake_qweight = tensor.to(torch.float16)


I think this should be the model dtype? (The quantized numbers could represent values that are only in bfloat16's range.)

Yes made the changes

danieldk · 2024-12-16T09:46:03Z

server/text_generation_server/layers/fp8.py

+                input_scale = (
+                    weights.get_tensor(f"{prefix}.input_scale", to_dtype=False)
+                    .reshape(-1)
+                    .max()


Is this safe? If input_scale is a vector, doesn't it need to be dequantized with the original scales first and then requantized with the new max?

I think this should be fine since it is the input_scale. Dequantization doesn’t make sense in this context because the input will be unquantized and will simply be quantized using this representative scale.

mht-sharma added 5 commits December 3, 2024 15:12

(feat) convert tscales to tensorwise

e2454db

(fix) fp8 scaling for cuda

e22cb47

(kernel) add marlin-kernels

2264702

add moe-kernels

de35b20

fix moe kernel comit

988c1dc

mht-sharma requested review from danieldk and Narsil December 11, 2024 11:51

danieldk reviewed Dec 16, 2024

View reviewed changes

fix scaling

bffccdd

mht-sharma requested a review from danieldk December 18, 2024 11:11

mht-sharma added 2 commits December 18, 2024 12:05

Merge branch 'main' into rocm-fp8-tensorwise

19688a0

nm changes

f8771d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm #2825

Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm #2825

mht-sharma commented Dec 11, 2024

danieldk Dec 16, 2024

danieldk Dec 16, 2024

mht-sharma Dec 18, 2024

danieldk Dec 16, 2024

mht-sharma Dec 18, 2024

Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm #2825

Are you sure you want to change the base?

Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm #2825

Conversation

mht-sharma commented Dec 11, 2024

What does this PR do?

Before submitting

danieldk Dec 16, 2024

Choose a reason for hiding this comment

danieldk Dec 16, 2024

Choose a reason for hiding this comment

mht-sharma Dec 18, 2024

Choose a reason for hiding this comment

danieldk Dec 16, 2024

Choose a reason for hiding this comment

mht-sharma Dec 18, 2024

Choose a reason for hiding this comment