Skip to content

Intel Neural Compressor Release 3.2

Latest
Compare
Choose a tag to compare
@thuang6 thuang6 released this 28 Dec 13:17
· 0 commits to master since this release
  • Highlights
  • Features
  • Improvements
  • Bug Fixes
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned with Habana 1.19 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • INT4 weight-only quantization on Intel® Arc™ B-Series Graphics GPU (code-named BattleMage)

Features

  • Saving and loading FP8 checkpoint on Gaudi
  • Loading vLLM/llm-compressor compatible FP8 checkpoint on Gaudi
  • Arbitrary scale method support on Gaudi
  • AutoRound INT4 weight-only quantization on Gaudi
  • Block-wise calibration for LLM on Gaudi
  • INT4 weight-only quantization on BattleMage

Improvements

  • Improve FP8 performance by setting scale as scalar tensor on Gaudi
  • Integrate AutoRound 0.4.2 with VLM quantization improvements
  • Improve safetensors loading for layer-wise quantization in Transformers-like API
  • Improve non-contiguous weight saving in Transformers-like API

Bug Fixes

  • Fix layer-wise quantization issue in GPTQ on client GPU
  • Fix glm-4-9b model out-of-memory issue on BattleMage

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)
  • Intel Arc B-Series Graphics GPU (B580)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.3, 2.4, 2.5