FP8 GEMM Kernels #1391

xiaoxiao26 · 2025-01-06T23:48:07Z

After leveraging Transformer Engine's FP8 features for PyTorch on H100, my linear layers in forward pass output GEMM kernels like sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas instead of sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

The main difference seems to be bf16bf16_bf16f32_f32 -> e4m3bf16_e4m3f32_f32

I'm curious how do I interpret this? I thought the pattern was [input_types]_[accumulator_type]_[output_type]. But that would imply that either one of the weights or activations is in bf16 rather than fp8. My understanding is that both are cast to fp8. Would appreciate if anyone can help correct my understanding here. Thank you!

Note I am also using AMP autocast with bf16 so maybe that is affecting things.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 GEMM Kernels #1391

FP8 GEMM Kernels #1391

xiaoxiao26 commented Jan 6, 2025 •

edited

Loading

FP8 GEMM Kernels #1391

FP8 GEMM Kernels #1391

Comments

xiaoxiao26 commented Jan 6, 2025 • edited Loading

xiaoxiao26 commented Jan 6, 2025 •

edited

Loading