Configurable max CTAs and NVLS usage for CUDA IPC communicator #4227

lzhangzz · 2025-12-20T14:15:54Z

TM_COMM_MAX_CTAS controls the max number of CTAs used in non-LL collectives
TM_COMM_NVLS_ENABLE controls whether NVLS can be used (to avoid malfunctioning NVLS)
TM_COMM_COPY_THRESHOLD send-size threshold to switch to copy engine based all-gather (10-15% boost for large send size because of reduced protocol cost)
default number of CTAs for NVLS based all-reduce collectives increased to 16 (compared to the value of 4 which is tuned for NV8-powered systems, this enables 2.5x peak throughput on NV18-powered systems)

lzhangzz added 4 commits December 20, 2025 10:22

configurable max ctas and nvls usage

9c2a962

more ctas for AG

6d080a5

minor

b4e563a

fix lint

82cc2c3

lvhan028 requested a review from irexyc December 23, 2025 03:02

lvhan028 added the improvement label Dec 23, 2025

irexyc approved these changes Dec 23, 2025

View reviewed changes

Provide feedback