OpenEvolve: Teaching LLMs To Discover Algorithms Through Evolution

How do we teach machines to find algorithms? Traditional approaches rely on hand-crafted estimation, exhaustive search, or gradient-based optimization. But what if we could harness the creative potential of large language models (LLMs) within an evolutionary framework?

OpenEvolve is an open-source evolutionary coding agent that integrates large language models into a quality-diversity search framework for algorithmic search. Candidate programs are generated through LLM-guided editing (difference-based by default), evaluated with user-defined metrics, and organized using MAP-Elites, while an island model with migration supports parallel, diverse exploration. The evaluation pipeline supports cascade staging and an artifact side-channel that feeds execution traces and errors back into subsequent signals; Optional LLM-based feedback can be incorporated into the scoring.

OpenEvolve has been applied in many domains – here are some examples: system optimization, scientific discovery, geospatial algorithms, scaling law discovery, GPU kernel optimization, prompt optimization, and more.

architectural overview

Figure 1: The OpenEvolve architecture shows the five interconnected components of the development loop

evolution loop

Quick Sample: Constructs context-rich signals by selecting an original program from the current island and curating the evidence set (fitness, lineage ancestors, diverse extremes in feature bins, and top performers by random sampling). Hints include original code, evaluation metrics, feature coordinates for MAP-Elites, development history, and (optionally) execution artifacts. Template selection supports difference-based editing by default or complete rewriting with controlled stochasticity.
LLM Group: Generates candidate codes using a weighted ensemble of OpenAI-compliant models (deterministic under the seed). In standard mode, a model is sampled by weight; In model-based islands, each island uses a fixed model. Responses drive either difference-based editing (search/replace blocks) or full rewriting (JSON/code-block extraction), with generation parameters drawn from the configuration.
Assessor: executes user supplied evaluate(program_path) With timeout and retry; Alternatively cascade evaluation is applied (evaluate_stage1/2/3) with the limitation of filtering out weak candidates early. It can incorporate LLM-based feedback into metrics and capture artifacts (e.g., stderr, traceback) for quick reference later. Parallel assessments are supported through an internal task pool.
Program Database: Implements MAP-Elites per island, binning programs with configurable feature dimensions (defaults include complexity and diversity; custom dimensions are derived from evaluator metrics). New candidates replace cell occupiers as fitness improves (choice). combined_scoreotherwise a safe numerical set except for feature dimensions). The database enforces population limits, tracks global bests, logs signals, supports migrations, and maintains checkpoints.
Controller: Organizes loops including seeding, logging, prompt/evaluator initialization, and process-based parallel execution. It schedules iterations across islands, manages checkpointing and resuming, enforces early stopping/target score criteria, stores artifacts, and writes the best searched program and its metadata to the output directory.

Major algorithmic innovations

Island-based development with lazy migration

OpenEvolve maintains multiple isolated populations (islands) that evolve independently to reduce premature convergence and enable parallel exploration. Migration is event-driven: each island migrates when its per-island event totals are reached at a configured interval rather than at wall-clock time since the last migration. Migration follows the ring topology by default (optional random migration), relocating a fraction of the top programs while avoiding duplicate code in the destination island.

# Configuration example
database:
  num_islands: 5
  migration_interval: 20   # generations, not iterations
  migration_rate: 0.1      # 10% of top programs migrate

MAP-Elites for Diversity Conservation

Each island maintains a MAP-ELITES grid on configurable feature dimensions (the default includes complexity and diversity; additional dimensions can be provided by the evaluator). If a candidate improves fitness it occupies the cell or changes it (by choice). combined_scoreotherwise a secure set on numerical metrics except feature dimensions). This enforces one elite per cell and preserves quality-diversity. The system also avoids exact duplicates (for example, during migration) and calculates diversity using structural measures (for example, edit distance) rather than relying on code embeddings.

cascade evaluation

The evaluation proceeds in stages with configurable thresholds. If cascade functions are provided, Stage 1 performs rapid testing (for example, import/execute), Stage 2 runs lightweight tests, and Stage 3 performs comprehensive benchmarks. Candidates must meet stage thresholds to proceed. Timeouts and exceptions are captured as artifacts and can be fed back into subsequent signals. When cascade functions are not defined, evaluation falls back to single-step evaluate(program_path) With timeout and retry.

double selection strategy

Parental selection is biased toward high-fitness programs, while the stimulus material shown to LLMs is drawn from complementary sources (top programs, lineage ancestors, diverse extremes in feature bins, and random samples). This separation encourages improvements guided by the current best while maintaining exploration pressure through diverse examples implemented through accelerated construction rather than direct recombination.

Sample Use Cases

Example 1: Algorithmic Search

On the AlgoTune benchmark, OpenEvolve discovered algorithms that achieved dramatic speedups through automatic optimization:

Figure 2: Algorithmic search results showing dramatic speedups on the Algotune benchmark

Major breakthroughs include JAX JIT compilation (321x), FFT-based convolution (256x), and automated discovery of optimized graph algorithms (95.78x). The system evolved from simple iterative implementation to sophisticated numerical computing patterns without human intervention. For a more detailed analysis, see Open Evolutionary Agents.

Example 2: Circle Packing

OpenEvolve matched state-of-the-art results (2.634 sum of radii for n=26), which evolved from simple geometric constructions to exploring scipy.optimize with SLSQP – a completely different algorithmic approach than the initial solution.

Example 3: GPU kernel optimization

Development of Metal GPU kernel for Transformers focus on Apple Silicon:

Figure 3: GPU kernel performance improvements to draw attention to Transformers on Apple Silicon

OpenEvolve discovered several non-obvious optimizations:

8-element SIMD vectorization Matching the hardware width of Apple Silicon
Two-pass online softmax reducing memory bandwidth
GQA-specific memory layout exploitation of head structure

These optimizations maintain 100% numerical accuracy while achieving measurable performance improvements across various estimation scenarios. For more details, see GPU Kernel Discovery.

Example 4: LLM Quick Optimization

Beyond the code, OpenEvolve can develop the prompts themselves:

Figure 4: Accelerated optimization results on GEPA benchmark

On the GEPA benchmark, the developed signals achieved +10.69% accuracy on HotpotQA (Multi-Hop Reasoning) and +6.42% overall accuracy on multiple benchmarks. This demonstrates the versatility of OpenEvolve – the same evolutionary framework optimizes both code and natural language.

Development Progress: As shown on the Algotune benchmark below, we see consistent improvements in performance across generations. Extended evolution (200 iterations) yielded 24% better results than shorter duration (100 iterations), suggesting that patient exploration of the solution space yields compounding benefits.

Figure 5: Performance improvements across generations, showing the mixed benefits of extended development

launch

OpenEvolve provides both a library and a command-line interface:

from openevolve import run_evolution

result = run_evolution(
    initial_program="def solve(x): return x * 2",
    evaluator=lambda path: {"score": benchmark(path)},
    iterations=100
)

For complex configurations, use YAML files specifying LLM models, development strategies, and evaluation parameters. OpenEvolve supports checkpoint/resume for long-running experiments and parallel evaluation across multiple cores. OpenEvolve is open-source and available on GitHub.

Update: This blog post was updated on November 1, 2025

<a href

OpenEvolve: Teaching LLMs to Discover Algorithms Through Evolution

architectural overview

evolution loop

Major algorithmic innovations

Island-based development with lazy migration

MAP-Elites for Diversity Conservation

cascade evaluation

double selection strategy

Sample Use Cases

Example 1: Algorithmic Search

Example 2: Circle Packing

Example 3: GPU kernel optimization

Example 4: LLM Quick Optimization

launch

Like this:

Related

Leave a Comment Cancel reply

architectural overview

evolution loop

Major algorithmic innovations

Island-based development with lazy migration

MAP-Elites for Diversity Conservation

cascade evaluation

double selection strategy

Sample Use Cases

Example 1: Algorithmic Search

Example 2: Circle Packing

Example 3: GPU kernel optimization

Example 4: LLM Quick Optimization

launch

Share this:

Like this:

Related

Leave a Comment Cancel reply