RightNow-AI/autokernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels. · GitHub

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to an optimized Triton kernel.

autokernel progress

Inspired by @karpathi/autoresearch – which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: the agent modifies a file, runs a certain evaluation, keeps or returns, repeats forever.

Give any PyTorch model to AutoKernel. it:

  1. profile Model to detect which GPU kernels are bottlenecks
  2. Removal Each constraint as a standalone Triton kernel
  3. Adaptation Each kernel autonomously (edit, benchmark, keep/revert – forever)
  4. Please attest it Report end-to-end accuracy and overall speedup

agent reads program.md – “Research Organization Code” – which contains comprehensive instructions for autonomous operations. it edits kernel.py runs one kernel at a time bench.py (Fixed benchmark with 5-step purity check + roofline analysis), and either keeps or reverts the change. Using Amdahl’s rule the orchestrator decides when to move to the next kernel.

Each experiment takes ~90 seconds. This is ~40 uses/hour across all kernels, ~320 overnight.

Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync

# One-time setup: test data + baselines
uv run prepare.py

# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
 --input-shape 1,512 --dtype float16

# Extract top bottleneck kernels
uv run extract.py --top 5

# Verify benchmark works
uv run bench.py

Spin Cloud, Codex, or any coding agent in this directory:

Read program.md and let's kick off a new experiment. Start with setup.

The agent will:

  1. Outline your model and present the optimization plan
  2. Create a branch (eg, autokernel/mar10-llama7b)
  3. Optimize each bottleneck kernel in priority order
  4. Verify end-to-end accuracy and report total speedup

program.md Intentionally broad so that the agent can run for 10+ hours without getting stuck. It includes a 6-level optimization playbook, decision framework, crash handling, and Amdahl’s law logic.

                 profile.py              extract.py           bench.py (loop)         verify.py
Any PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end
   model          by GPU time       Triton kernels          kernel (agent)       verification

tool what does it do
profile.py Profiles any PyTorch model torch.profilerRanks kernels based on GPU time, classifying as compute/memory-bound
extract.py Extracts top-n bottleneck kernels from profiling results into standalone Triton kernel files
orchestrate.py Multi-kernel scheduler: uses Amdahl’s rule to decide which kernel to optimize next, tracks overall progress
bench.py Definitive benchmark: 5-step purity (smoke, size sweep, numerical stability, determinism, edge cases) + performance + ceiling
verify.py Plugs optimized kernel back into model, checks end-to-end correctness, reports total speedup

9 kernel types covering the main operations of modern deep learning:

kernels Description key metric
matmul Dense Matrix Multiplication (M x K) @ (K x N) TFLOPS
softmax Row-parallel numerically stable softmax GB/s
layernorm Layer normalization with affine transform GB/s
rmsnorm RMS normalization (LLAMA-style) GB/s
flash_focus Scaled dot-product attention with causal masking. TFLOPS
fused_mlp Swiglu-style fused MLP (gate + up + down) TFLOPS
cross_entropy fused cross entropy loss GB/s
rotary_embedding Rotary Position Embedding (ROPE) GB/s
to reduce parallel reduction (addition) GB/s

Each has a PyTorch reference reference.py and a starter triton kernel kernels/.

Self-contained model definitions shipped with AutoKernel (number) transformers Library requirement):

Sample file parameters Use
gpt-2 small models/gpt2.py 124m --class-name GPT2 --input-shape 1,1024
LLAMA (Compact) models/llama_7b.py 160m --class-name LlamaModel --input-shape 1,512
LLAMA 7B models/llama_7b.py 7b --class-name LlamaModel7B --input-shape 1,2048
bert-bass models/bert_base.py 110m --class-name BertModel --input-shape 8,512
custom models/custom.py Template for your own model

For HuggingFace model (uv sync --extra models):

uv run profile.py --module transformers --class-name AutoModelForCausalLM \
 --pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16
autokernel/
  kernel.py             the file the agent modifies (one kernel at a time)
  program.md            agent instructions -- the "research org code"

  bench.py              fixed benchmark + 5-stage correctness harness
  reference.py          PyTorch reference implementations (ground truth)
  prepare.py            one-time setup: test data, baselines

  profile.py            profile any PyTorch model, rank kernels by GPU time
  extract.py            extract bottleneck kernels into workspace/
  orchestrate.py        multi-kernel scheduler (Amdahl's law)
  verify.py             end-to-end model verification + speedup report
  analysis.py           experiment visualization (generates progress.png)

  kernels/              starter Triton kernels (9 types)
  models/               self-contained model definitions (GPT-2, LLaMA, BERT)
  workspace/            runtime artifacts (gitignored)

Why Triton? Readable Python-like syntax the agent can understand and modify inline without mastering PTX or SASS. A well-tuned Triton routinely reaches 80-95% Cubelas. The agent needs to iterate fast – Triton compiles in seconds, not minutes.

Accuracy first. The benchmark checks the kernel output against PyTorch before measuring performance. A fast but incorrect kernel is returned immediately. This prevents the agent from “optimizing” by generating garbage.

Amdahl’s law orchestration. The orchestrator prioritizes based on impact. The 1.5x speedup on 60% kernels (1.25x end-to-end) beats the 3x speedup on 5% kernels (1.03x end-to-end). It moves forward when there are diminishing returns.

Single file to modify. agent just touches kernel.py. Scope remains manageable, remains reviewable, remains clean.

TSV logging. results go on a field results.tsv file. Human-readable, git-friendly, trivially parsable, no infrastructure.

Every experiment is logged results.tsv (separated by tabs):

pillar Description
experiment Sequential experiment number (0 = baseline)
tag short identifier
kernel_type Which kernel (eg, matmul)
throughput_tflops Measured Throughput (higher is better)
latency_us execution time in microseconds
pct_peak GPU theoretical peak percentage
speedup_vs_pytorch Speedup vs PyTorch/cuBLAS
correctness Pass, Fail, Timeout, or Crash
peak_vram_mb peak gpu memory usage
description what was attempted

this project is Autoresearch for GPU kernel – Directly inspired by Andrej Karpathy’s AutoResearch, original experiments in autonomous AI research agents for LLM training. Karpathy showed that an AI agent could run hundreds of experiments overnight, systematically exploring a search space and logging every result. The auto kernel implements the same loop – the agent edits a file, runs a certain evaluation, places or returns to the domain of GPU kernel optimization with Triton.

Created by the team behind Forge.

MIT



<a href

Leave a Comment