Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to an optimized Triton kernel.

Inspired by @karpathi/autoresearch – which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: the agent modifies a file, runs a certain evaluation, keeps or returns, repeats forever.
Give any PyTorch model to AutoKernel. it:
- profile Model to detect which GPU kernels are bottlenecks
- Removal Each constraint as a standalone Triton kernel
- Adaptation Each kernel autonomously (edit, benchmark, keep/revert – forever)
- Please attest it Report end-to-end accuracy and overall speedup
agent reads program.md – “Research Organization Code” – which contains comprehensive instructions for autonomous operations. it edits kernel.py runs one kernel at a time bench.py (Fixed benchmark with 5-step purity check + roofline analysis), and either keeps or reverts the change. Using Amdahl’s rule the orchestrator decides when to move to the next kernel.
Each experiment takes ~90 seconds. This is ~40 uses/hour across all kernels, ~320 overnight.
Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync
# One-time setup: test data + baselines
uv run prepare.py
# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
--input-shape 1,512 --dtype float16
# Extract top bottleneck kernels
uv run extract.py --top 5
# Verify benchmark works
uv run bench.py
Spin Cloud, Codex, or any coding agent in this directory:
Read program.md and let's kick off a new experiment. Start with setup.
The agent will:
- Outline your model and present the optimization plan
- Create a branch (eg,
autokernel/mar10-llama7b) - Optimize each bottleneck kernel in priority order
- Verify end-to-end accuracy and report total speedup
program.md Intentionally broad so that the agent can run for 10+ hours without getting stuck. It includes a 6-level optimization playbook, decision framework, crash handling, and Amdahl’s law logic.
profile.py extract.py bench.py (loop) verify.py
Any PyTorch ──> Rank kernels ──> Generate baseline ──> Optimize each ──> End-to-end
model by GPU time Triton kernels kernel (agent) verification
| tool | what does it do |
|---|---|
profile.py |
Profiles any PyTorch model torch.profilerRanks kernels based on GPU time, classifying as compute/memory-bound |
extract.py |
Extracts top-n bottleneck kernels from profiling results into standalone Triton kernel files |
orchestrate.py |
Multi-kernel scheduler: uses Amdahl’s rule to decide which kernel to optimize next, tracks overall progress |
bench.py |
Definitive benchmark: 5-step purity (smoke, size sweep, numerical stability, determinism, edge cases) + performance + ceiling |
verify.py |
Plugs optimized kernel back into model, checks end-to-end correctness, reports total speedup |
9 kernel types covering the main operations of modern deep learning:
| kernels | Description | key metric |
|---|---|---|
| matmul | Dense Matrix Multiplication (M x K) @ (K x N) | TFLOPS |
| softmax | Row-parallel numerically stable softmax | GB/s |
| layernorm | Layer normalization with affine transform | GB/s |
| rmsnorm | RMS normalization (LLAMA-style) | GB/s |
| flash_focus | Scaled dot-product attention with causal masking. | TFLOPS |
| fused_mlp | Swiglu-style fused MLP (gate + up + down) | TFLOPS |
| cross_entropy | fused cross entropy loss | GB/s |
| rotary_embedding | Rotary Position Embedding (ROPE) | GB/s |
| to reduce | parallel reduction (addition) | GB/s |
Each has a PyTorch reference reference.py and a starter triton kernel kernels/.
Self-contained model definitions shipped with AutoKernel (number) transformers Library requirement):
| Sample | file | parameters | Use |
|---|---|---|---|
| gpt-2 small | models/gpt2.py |
124m | --class-name GPT2 --input-shape 1,1024 |
| LLAMA (Compact) | models/llama_7b.py |
160m | --class-name LlamaModel --input-shape 1,512 |
| LLAMA 7B | models/llama_7b.py |
7b | --class-name LlamaModel7B --input-shape 1,2048 |
| bert-bass | models/bert_base.py |
110m | --class-name BertModel --input-shape 8,512 |
| custom | models/custom.py |
— | Template for your own model |
For HuggingFace model (uv sync --extra models):
uv run profile.py --module transformers --class-name AutoModelForCausalLM \
--pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16
autokernel/
kernel.py the file the agent modifies (one kernel at a time)
program.md agent instructions -- the "research org code"
bench.py fixed benchmark + 5-stage correctness harness
reference.py PyTorch reference implementations (ground truth)
prepare.py one-time setup: test data, baselines
profile.py profile any PyTorch model, rank kernels by GPU time
extract.py extract bottleneck kernels into workspace/
orchestrate.py multi-kernel scheduler (Amdahl's law)
verify.py end-to-end model verification + speedup report
analysis.py experiment visualization (generates progress.png)
kernels/ starter Triton kernels (9 types)
models/ self-contained model definitions (GPT-2, LLaMA, BERT)
workspace/ runtime artifacts (gitignored)
Why Triton? Readable Python-like syntax the agent can understand and modify inline without mastering PTX or SASS. A well-tuned Triton routinely reaches 80-95% Cubelas. The agent needs to iterate fast – Triton compiles in seconds, not minutes.
Accuracy first. The benchmark checks the kernel output against PyTorch before measuring performance. A fast but incorrect kernel is returned immediately. This prevents the agent from “optimizing” by generating garbage.
Amdahl’s law orchestration. The orchestrator prioritizes based on impact. The 1.5x speedup on 60% kernels (1.25x end-to-end) beats the 3x speedup on 5% kernels (1.03x end-to-end). It moves forward when there are diminishing returns.
Single file to modify. agent just touches kernel.py. Scope remains manageable, remains reviewable, remains clean.
TSV logging. results go on a field results.tsv file. Human-readable, git-friendly, trivially parsable, no infrastructure.
Every experiment is logged results.tsv (separated by tabs):
| pillar | Description |
|---|---|
experiment |
Sequential experiment number (0 = baseline) |
tag |
short identifier |
kernel_type |
Which kernel (eg, matmul) |
throughput_tflops |
Measured Throughput (higher is better) |
latency_us |
execution time in microseconds |
pct_peak |
GPU theoretical peak percentage |
speedup_vs_pytorch |
Speedup vs PyTorch/cuBLAS |
correctness |
Pass, Fail, Timeout, or Crash |
peak_vram_mb |
peak gpu memory usage |
description |
what was attempted |
this project is Autoresearch for GPU kernel – Directly inspired by Andrej Karpathy’s AutoResearch, original experiments in autonomous AI research agents for LLM training. Karpathy showed that an AI agent could run hundreds of experiments overnight, systematically exploring a search space and logging every result. The auto kernel implements the same loop – the agent edits a file, runs a certain evaluation, places or returns to the domain of GPU kernel optimization with Triton.
Created by the team behind Forge.
MIT
<a href