Fast and Furious AMD Kernels · Hazy Research

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winnata, Daniel Fu, Ryan Swan, Muhammad Osama, Christopher Ray, Simran Arora
Link:arxiv | code

AI is gated by hardware. We believe that opening up the compute landscape of AI is one of the most important problems that should be worked on right now. Toward this goal, we present Hipkittens: a collection of opinionated programming primitives to simplify SoTA AMD kernel and AMD kernel development!

It was named after AMD’s CUDA counterpart, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to reach its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is off the mark for AI workflows due to the lack of mature AMD software.

Imagination	nvidia b200 sxm5	amd mi355x oam
BF16 Matrix/Tensor	2.2 PFLOP	2.5 PFLOP
MXFP8 Matrix/Tensor	4.5 PFLOP	5.0 PFLOP
MXFP6 Matrix/Tensor	4.5 PFLOP	10.1 PFLOP
MXFP4 Matrix/Tensor	9.0 PFLOP	10.1 PFLOP
memory capacity	180 GB	288 GB
Memory Bandwidth	8.0 TB/s	8.0 TB/s

Table 1: Hardware overview. Extreme memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high-performance AI kernel library; PyTorch and some compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD’s C++-based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

Existing software offerings consistently fail to achieve peak performance. CK kernels often perform poorly (see our evaluation below). AITER and PyTorch are unstable; For example, AITER and PyTorch SDPA Llama GQA backward kernels achieve only 30% and 24% of SoTA performance, respectively, on AMD MI355X GPUs. And the compilers are currently significantly lacking in performance and have not yet demonstrated reusable programming primitives for AMD. Additionally, we have discovered that some important aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, which are discussed in our technical in-depth blog.

Description: Expand to learn more about our current compiler observations

We expand on some observations about current compilers:

Mojo’s MHA forward kernel suffers from bank conflicts and provides only 50% peak performance on the Mi355X (e.g., 430 TFLOPs at B=16, H=16, N=2048, D=128).
TileLang is currently limited to CDNA3 and the MHA kernel is “competitive with PyTorch”, which is one of the slowest baselines (see results below). TileLang lacks many features that we believe are important for AMD kernels – multiple matrix core sizes like 32x32x16, buffer_load_wordx4, XCD chiplet swizzling – and it also has a dependency on composable kernels, which adds complexity to the library.
On AMD, Triton struggles with register lifetime tracking and minimizing memory accesses to the most performing intrinsics. For example, it may fail to reclaim registers or reduce vectorized loads. In our evaluation, we found that Triton kernels perform worse than AMD’s ROCm/Triton (even on the vanilla BF16 GEMM).

While the Pythonic interface and portability of such compilers is helpful, our goal is to identify the principles that lead to peak performance.

As a result, AMD’s most performant AI kernels need to be hand-optimized by experts. in raw assembly, It is very difficult to measure the breadth of AI workloads and as a result, the most widely used AI workloads are unsupported/under-optimized on AMD (for example, we see this at some attention problem size, non-causal GQA backward passes, memory bound kernels).

asm — Image: What is RAW assembly? Can’t understand it? that’s the point!

With all this said, it is still unclear what is the best path for multi-silicon kernel development!

As a result, the AI community says there is a CUDA gap in AI software: Tweet #1, Tweet #2, Tweet #3 and more.

But that being said, it was also quite difficult to develop performant NVIDIA kernels a few years ago. It took two years between the release of the H100 GPU and the release of the peak-performance open-source attention kernel, using low-level CUDA/CUTLASS. Compilers and LLM-for-kernel-development on NVIDIA have so far sacrificed performance for simplicity and struggled to quickly support new hardware features.

Opinionated primitives are simplifying the process of writing NVIDIA kernels! Surprisingly through community effort on all these DSLs and AI helpers, NVIDIA kernel development is starting to get easier! Last year, we shared our take on the kernel DSL – keep the familiar PyTorch experience, but create embedded C++ primitives to achieve extreme performance, simplicity, and extensibility across hardware platforms and AI workloads. In May 2024, we shared ThunderKittens (TK), and this year we’re excited to see its ideas used in a wave of frameworks like QTE DSL, tiny Gradle “TinyKittens” in September 2025, TileLang in April 2025, and Gluon in June 2025. It has been fun to see TK being used by companies like Together AI, Jump Trading, and Cursor, and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives are sufficient. It was not clear to us where this exploration would end; Most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware is (not) meaningfully different. wgmma/tcgen05 Pipelined Async Metamls, No TMANo obstaclesNo register reallocation, smaller shared memory, chiplets instead of monolithic, etc.) and we weren’t sure where performance would end up nor how different the primitives might look compared to the NVIDIA framework.

Our exploration resulted in HipKittens, a minimal, thoughtful collection of C++ embedded programming primitives for fast AMD kernels. We see:

Tile abstraction is generalized in architecture. The core tile-based primitives we identified as effective on NVIDIA GPUs – including tile types, PyTorch-like bulk compute operators on tiles, and composable load/store interfaces – naturally translate to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (for example, swizzling scheme, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. Scheduling patterns within a processor, and across processors, differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave expertise performs poorly on CDNA3 and CDNA4. However, the argument about scheduling at tile granularity – rather than at the level of individual registers or memory transactions – continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain common across all architectures, providing evidence that a unified, performant programming model for AI accelerators can be achieved. The key insight is to separate the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

Getting out of the CUDA gap: Introducing Hipkittens

We first explored ThunderKittens for NVIDIA, then ThunderKittens on Apple silicon and now we’re excited to share HipKittens (HK) for AMD!

Image: The cinematic universe of kittens! Towards Multi-Silicon AI!

HK kernels are executable, while being easy to read and modify! We will no longer need raw assembly for the highest performance AMD kernels! Life is good!

hk wave — Image: Hipkittens riding the ~wave~ (not warp).

Let’s take a look at the results:

Our further focus is that the kernels are written in ~500 lines of code and on average outperform all AMD baselines, including AITER kernels written in hand-optimized assembly! We show different dimensions of the head ( $D$

Fast and Furious AMD Kernels · Hazy Research 20

Fast and Furious AMD Kernels · Hazy Research 21

There is a hot loop in our GEMM kernel $<100$

Fast and Furious AMD Kernels · Hazy Research 24

Fast and Furious AMD Kernels · Hazy Research 25

We also find backward pass, rotary, and fused dropout-residual-layernorm kernels faster than the strongest available baselines! These results use the top dimension $128$

Fast and Furious AMD Kernels · Hazy Research 26

Fast and Furious AMD Kernels · Hazy Research 27

Multi-Silicon AI is Coming!

Realizing the full potential of AI requires diverse, open hardware.^{Today, that means making AMD GPUs truly accessible.}

We want more AI in the world. AI has relied on and innovated from a single hardware provider, but we need to be able to access and use all types of computations. We need to be able to use the fastest hardware we can. We’re happy to help solve these problems with HipKittens!

Checkout part two for a technical in-depth look at HK.

Link:arxiv | code

Fast and Furious AMD Kernels · Hazy Research

Building towards multi-silicon AI systems

Getting out of the CUDA gap: Introducing Hipkittens

Multi-Silicon AI is Coming!

Like this:

Related

Leave a Comment Cancel reply

Building towards multi-silicon AI systems

Getting out of the CUDA gap: Introducing Hipkittens

Multi-Silicon AI is Coming!

Share this:

Like this:

Related

Leave a Comment Cancel reply