Highlight
- Improved NCCL API integration in MSCCL++ for better performance and usability
- Enhanced execution plan-based executor in MSCCL++
- Fixed several bugs to improve stability and reliability
What's Changed
- Add support for different vector sizes in multimem instructions by @roshandathathri in #332
- NCCL API Executor Integration by @caiomcbr in #331
- Fix missing import in executor test by @yzygitzh in #334
- bfloat16 support by @chhwang in #336
- Dynamically load libibverbs by @caiomcbr in #337
- Auto-tune vector sizes for NVLS allreduce6 by @roshandathathri in #338
- Make ibverbs optional at compile time by @chhwang in #340
- ProxyChannel Support in Executor by @caiomcbr in #342
- Support executors to send packets over ProxyChannel by @caiomcbr in #344
- Fix for ROCm 6.0 by @chhwang in #347
- Fix bug for construct sempaphore by @Binyang2014 in #341
- Add proxy channel related operations by @Binyang2014 in #351
- Add CI for rocm by @Binyang2014 in #346
- Tune threads per block for mscclpp executor by @Binyang2014 in #345
- Fix NPKit exit event offset by @yzygitzh in #356
- Use IB transport flags only when an IB device exists by @chhwang in #355
- Update ROCm CI by @chhwang in #357
- Fixing RegisterMemory Allocation for ProxyChannels by @caiomcbr in #353
- Fix NCCL API bugs by @chhwang in #363
- Perf optimization & support clipping by @chhwang in #364
- Fix copyright messages by @chhwang in #367
- [Doc] mscclpp docs by @Binyang2014 in #348
- Executor AllGather In-Place Support by @caiomcbr in #365
- Fix algo repo name by @Binyang2014 in #369
- Update docker image for cuda12.4 by @Binyang2014 in #370
- Fix in-place all-gather input buffer in executor_test by @yzygitzh in #372
- [docs] fix quickstart link by @jeffra in #374
- Add kernel-based verification for executor_test by @yzygitzh in #378
- Lazily create the context stream by @chhwang in #381
- Fixing Bug Const Offset in Execution Plan by @caiomcbr in #380
- Fix light load bug by @Binyang2014 in #379
- Small Adjust in Test Data AllGather at Executor Test by @caiomcbr in #384
- Fix missing packet parameter for executor by @yzygitzh in #385
- NVLS support for msccl++ executor by @Binyang2014 in #375
- Fix typo by @Binyang2014 in #389
- Improve CMake options by @chhwang in #376
- Fixing Message Boundary AllReduce Fallback Code by @caiomcbr in #391
- Fix mscclpp_benchmark by @Binyang2014 in #392
- Add cross threadblock barrier by @Binyang2014 in #383
- AllGather Executor Support in NCCL Interface by @caiomcbr in #393
- Providing reduce-scatter test support by @caiomcbr in #390
- Select algo according to json config by @Binyang2014 in #396
- Add connection events for NPKit by @yzygitzh in #386
- Revised ProxyChannel interfaces by @chhwang in #400
- Setup pipeline for mscclpp over nccl by @Binyang2014 in #401
- Exception Max Number Operation per Tb by @caiomcbr in #405
- Reduce memory usage for scratch buffer by @Binyang2014 in #403
- [Cherry-pick] Move pipeline to official org (#406) by @Binyang2014 in #416
- [Cherry-pick] trigger ci for release branches (#426) by @Binyang2014 in #427
- [Cherry-pick] Disable CuMemMap check for ROCm (#411) by @Binyang2014 in #424
- [Cherry-pick] NVLS support for NCCL API (#410) by @Binyang2014 in #425
- [Cherry-pick] Fix nccl-test failure issue (#421) by @Binyang2014 in #429
New Contributors
Full Changelog: v0.5.2...v0.6.0