RightNow-AI/autokernel: Autoresearch For GPU Kernels. Give It Any PyTorch Model, Go To Sleep, Wake Up To Optimized Triton Kernels. · GitHub

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to an optimized Triton kernel.

autokernel progress

Inspired by @karpathi/autoresearch – which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: the agent modifies a file, runs a certain evaluation, keeps or returns, repeats forever.

Give any PyTorch model to AutoKernel. it:

profile Model to detect which GPU kernels are bottlenecks
Removal Each constraint as a standalone Triton kernel
Adaptation Each kernel autonomously (edit, benchmark, keep/revert – forever)
Please attest it Report end-to-end accuracy and overall speedup

agent reads program.md – “Research Organization Code” – which contains comprehensive instructions for autonomous operations. it edits kernel.py runs one kernel at a time bench.py (Fixed benchmark with 5-step purity check + roofline analysis), and either keeps or reverts the change. Using Amdahl’s rule the orchestrator decides when to move to the next kernel.

Each experiment takes ~90 seconds. This is ~40 uses/hour across all kernels, ~320 overnight.

Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync

# One-time setup: test data + baselines
uv run prepare.py

# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
 --input-shape 1,512 --dtype float16

# Extract top bottleneck kernels
uv run extract.py --top 5

# Verify benchmark works
uv run bench.py

Spin Cloud, Codex, or any coding agent in this directory:

Read program.md and let's kick off a new experiment. Start with setup.

The agent will:

Outline your model and present the optimization plan
Create a branch (eg, autokernel/mar10-llama7b)
Optimize each bottleneck kernel in priority order
Verify end-to-end accuracy and report total speedup

program.md Intentionally broad so that the agent can run for 10+ hours without getting stuck. It includes a 6-level optimization playbook, decision framework, crash handling, and Amdahl’s law logic.

                 profile.py              extract.py           bench.py (loop)         verify.py
Any PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end
   model          by GPU time       Triton kernels          kernel (agent)       verification

tool	what does it do
`profile.py`	Profiles any PyTorch model `torch.profiler`Ranks kernels based on GPU time, classifying as compute/memory-bound
`extract.py`	Extracts top-n bottleneck kernels from profiling results into standalone Triton kernel files
`orchestrate.py`	Multi-kernel scheduler: uses Amdahl’s rule to decide which kernel to optimize next, tracks overall progress
`bench.py`	Definitive benchmark: 5-step purity (smoke, size sweep, numerical stability, determinism, edge cases) + performance + ceiling
`verify.py`	Plugs optimized kernel back into model, checks end-to-end correctness, reports total speedup

9 kernel types covering the main operations of modern deep learning:

kernels	Description	key metric
matmul	Dense Matrix Multiplication (M x K) @ (K x N)	TFLOPS
softmax	Row-parallel numerically stable softmax	GB/s
layernorm	Layer normalization with affine transform	GB/s
rmsnorm	RMS normalization (LLAMA-style)	GB/s
flash_focus	Scaled dot-product attention with causal masking.	TFLOPS
fused_mlp	Swiglu-style fused MLP (gate + up + down)	TFLOPS
cross_entropy	fused cross entropy loss	GB/s
rotary_embedding	Rotary Position Embedding (ROPE)	GB/s
to reduce	parallel reduction (addition)	GB/s

Each has a PyTorch reference reference.py and a starter triton kernel kernels/.

Self-contained model definitions shipped with AutoKernel (number) transformers Library requirement):

Sample	file	parameters	Use
gpt-2 small	`models/gpt2.py`	124m	`--class-name GPT2 --input-shape 1,1024`
LLAMA (Compact)	`models/llama_7b.py`	160m	`--class-name LlamaModel --input-shape 1,512`
LLAMA 7B	`models/llama_7b.py`	7b	`--class-name LlamaModel7B --input-shape 1,2048`
bert-bass	`models/bert_base.py`	110m	`--class-name BertModel --input-shape 8,512`
custom	`models/custom.py`	—	Template for your own model

For HuggingFace model (uv sync --extra models):

uv run profile.py --module transformers --class-name AutoModelForCausalLM \
 --pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16

autokernel/
  kernel.py             the file the agent modifies (one kernel at a time)
  program.md            agent instructions -- the "research org code"

  bench.py              fixed benchmark + 5-stage correctness harness
  reference.py          PyTorch reference implementations (ground truth)
  prepare.py            one-time setup: test data, baselines

  profile.py            profile any PyTorch model, rank kernels by GPU time
  extract.py            extract bottleneck kernels into workspace/
  orchestrate.py        multi-kernel scheduler (Amdahl's law)
  verify.py             end-to-end model verification + speedup report
  analysis.py           experiment visualization (generates progress.png)

  kernels/              starter Triton kernels (9 types)
  models/               self-contained model definitions (GPT-2, LLaMA, BERT)
  workspace/            runtime artifacts (gitignored)

Why Triton? Readable Python-like syntax the agent can understand and modify inline without mastering PTX or SASS. A well-tuned Triton routinely reaches 80-95% Cubelas. The agent needs to iterate fast – Triton compiles in seconds, not minutes.

Accuracy first. The benchmark checks the kernel output against PyTorch before measuring performance. A fast but incorrect kernel is returned immediately. This prevents the agent from “optimizing” by generating garbage.

Amdahl’s law orchestration. The orchestrator prioritizes based on impact. The 1.5x speedup on 60% kernels (1.25x end-to-end) beats the 3x speedup on 5% kernels (1.03x end-to-end). It moves forward when there are diminishing returns.

Single file to modify. agent just touches kernel.py. Scope remains manageable, remains reviewable, remains clean.

TSV logging. results go on a field results.tsv file. Human-readable, git-friendly, trivially parsable, no infrastructure.

Every experiment is logged results.tsv (separated by tabs):

pillar	Description
`experiment`	Sequential experiment number (0 = baseline)
`tag`	short identifier
`kernel_type`	Which kernel (eg, `matmul`)
`throughput_tflops`	Measured Throughput (higher is better)
`latency_us`	execution time in microseconds
`pct_peak`	GPU theoretical peak percentage
`speedup_vs_pytorch`	Speedup vs PyTorch/cuBLAS
`correctness`	Pass, Fail, Timeout, or Crash
`peak_vram_mb`	peak gpu memory usage
`description`	what was attempted

this project is Autoresearch for GPU kernel – Directly inspired by Andrej Karpathy’s AutoResearch, original experiments in autonomous AI research agents for LLM training. Karpathy showed that an AI agent could run hundreds of experiments overnight, systematically exploring a search space and logging every result. The auto kernel implements the same loop – the agent edits a file, runs a certain evaluation, places or returns to the domain of GPU kernel optimization with Triton.

Created by the team behind Forge.

MIT

<a href

RightNow-AI/autokernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels. · GitHub

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply