How about the grouplinear? #1386

south-ocean · 2024-12-26T03:48:56Z

Hi，i noticed that in our multi-stream implementation of group linear, we didn't use batchgemm or groupgemm. Is there any particular reason for this?

yaox12 · 2024-12-31T02:53:02Z

Batched gemm doesn't support different GEMM sizes.

For grouped gemm, we have two (potential) implementations:

cublasGemmGroupedBatchedEx(). The performance is not as good as the multi-stream implementation, and it doesn't support FP8 for now.
Cutlass. We evaluate the performance with GEMM sizes from popular MoE models, including Mixtral 8x7B, 8x22B, Qwen2-57B-A14B, and DeepSeek v2, multi-stream calls to cuBLASLt shows better performance in most cases.

We will keep exploring the performance of grouped gemm, and would add them to TE if there were better implementaions.

south-ocean · 2025-01-03T02:01:13Z

Got it, thanks.

Provide feedback