- Highlights
- Features
- Improvements
- Bug Fixes
- Validated Hardware
- Validated Configurations
Highlights
- Aligned with Habana 1.19 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
- INT4 weight-only quantization on Intel® Arc™ B-Series Graphics GPU (code-named BattleMage)
Features
- Saving and loading FP8 checkpoint on Gaudi
- Loading vLLM/llm-compressor compatible FP8 checkpoint on Gaudi
- Arbitrary scale method support on Gaudi
- AutoRound INT4 weight-only quantization on Gaudi
- Block-wise calibration for LLM on Gaudi
- INT4 weight-only quantization on BattleMage
Improvements
- Improve FP8 performance by setting scale as scalar tensor on Gaudi
- Integrate AutoRound 0.4.2 with VLM quantization improvements
- Improve safetensors loading for layer-wise quantization in Transformers-like API
- Improve non-contiguous weight saving in Transformers-like API
Bug Fixes
- Fix layer-wise quantization issue in GPTQ on client GPU
- Fix glm-4-9b model out-of-memory issue on BattleMage
Validated Hardware
- Intel Gaudi Al Accelerators (Gaudi 2 and 3)
- Intel Xeon Scalable processor (4th, 5th, 6th Gen)
- Intel Core Ultra Processors (Series 1 and 2)
- Intel Data Center GPU Max Series (1100)
- Intel Arc B-Series Graphics GPU (B580)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11
- Python 3.9, 3.10, 3.11, 3.12
- PyTorch/IPEX 2.3, 2.4, 2.5