Rewriting Transformer Blocks as GEMM-Epilogue Programs

View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilog Programs, written by Han Guo and 6 other authors.
View PDF HTML (experimental)

abstract:Transformer training systems are built around dense linear algebra, yet a non-trivial fraction of the end-to-end time is spent on surrounding memory-bound operators. Normalization, activation, residual updating, reduction, and related calculations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement a significant bottleneck in an otherwise highly optimized training stack. We introduce CODA, a GPU kernel abstraction that expresses these computations as a GEMM-plus-epilog program. CODA is based on the observation that many transformer operators exposed as separate framework kernels can be re-parameterized to be executed algebraically, while the GEMM output tile resides on the chip before being written into memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilog primitives for scaling, reduction, pairwise transformation, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover almost all non-attention computations in the forward and backward passes of standard Transformer blocks. In representative Transformer workloads, both human- and LLM-written CODA kernels achieve high performance, suggesting that GEMM-plus-Epilog programming provides a practical path toward combining framework-level productivity with hardware-level efficiency.

Submission History

From: Han Guo [view email]
[v1]

Tuesday, 19 May 2026 02:30:43 UTC (1,121 KB)
[v2]

Wed, 20 May 2026 17:38:24 UTC (493 KB)



<a href

Leave a Comment