AlphaEvolve: DeepMind's Evolutionary Coding Agent Explained
White Paper

AlphaEvolve: DeepMind's Evolutionary Coding Agent Explained

Jake McCluskeyUpdated
Back to white papers

Source: AlphaEvolve, DeepMind (Google DeepMind, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

AlphaEvolve is DeepMind's evolutionary coding agent. It uses Gemini Flash plus Gemini Pro as an ensemble to propose candidate programs, an automated evaluator to score them, and an evolutionary database to pick survivors for the next generation. The headline result: a new algorithm for 4x4 complex matrix multiplication using 48 scalar multiplications, the first improvement on Strassen's 1969 result for this setting in 56 years. And it's not just a paper trick. Real Google production: 0.7% of worldwide compute recovered through Borg data-center scheduling, a 23% speedup on a Gemini training kernel, a 32.5% speedup on a FlashAttention GPU kernel, and contributions to an upcoming TPU design.

The pattern it proves: LLM plus verifier in an evolutionary loop can discover algorithms humans couldn't.

1. What it is

AlphaEvolve is a pipeline, not a single model. Three cooperating components.

1.1. The prompt sampler

This piece assembles the next prompt by pulling high-scoring programs from the evolutionary database. It's basically an in-context "curriculum" that biases Gemini toward productive directions. Think of it as a contextual exploration policy encoded as prompt construction.

1.2. The LLM ensemble

Two Gemini variants in a speed-vs-depth split:

ModelRole
Gemini FlashBreadth. Maximize variety of ideas explored.
Gemini ProDepth. High-quality, carefully reasoned variants.

Both produce candidate programs, actual code in Python or Verilog or whatever DSL the task uses.

1.3. The automated evaluator

Every candidate is compiled, run, and scored against a metric. The metric is the entire game. If your metric is "number of scalar multiplications to compute a matrix product," the evaluator counts multiplications. If your metric is "Verilog circuit's gate count subject to functional correctness," it synthesizes and verifies.

Critical insight: the evaluator must be automated, cheap, and sound. You can't outsource it to a human because the loop runs millions of programs. Soundness matters too, since an incorrect program that sneaks past a weak test becomes an evolutionary winner and poisons the descendants.

1.4. The evolutionary database

Stores scored programs and implements selection. New prompts get built from winners, losers get culled, and diversity is maintained so the search doesn't collapse onto a local optimum.

1.5. The loop

    ┌───────────────────────────────────────────────┐
    │                                               │
    │   Prompt Sampler ──▶ Gemini Flash + Pro       │
    │        ▲                 │                    │
    │        │                 ▼                    │
    │        │         Candidate Programs           │
    │        │                 │                    │
    │        │                 ▼                    │
    │        │         Automated Evaluator          │
    │        │                 │                    │
    │        │                 ▼                    │
    │        └─── Evolutionary Database ◀───┐       │
    │                      │                │       │
    │                      └── select best ─┘       │
    │                                               │
    └───────────────────────────────────────────────┘

Repeat for thousands of generations. The LLMs propose; the evaluator disposes.

2. Why it matters

2.1. Strassen to 48 multiplications: the 56-year result

For decades, multiplying two 4x4 complex matrices using Strassen's recursive decomposition (1969) was the known state of the art. AlphaEvolve found a new algorithm using 48 scalar multiplications for this setting, an improvement over Strassen.

For non-ML readers: matrix multiplication sits at the bottom of basically all modern computing (graphics, simulation, neural networks, scientific computing). Any improvement compounds across every GPU kernel ever run.

2.2. AlphaEvolve eats AlphaTensor

DeepMind's 2022 AlphaTensor was a specialist matrix-multiplication discoverer. AlphaEvolve is a generalist, same framework across many domains, and it found a strictly better result on the 4x4 complex case that AlphaTensor couldn't touch.

The lesson: specialist to generalist is the dominant trajectory in algorithm discovery. If you're building a specialist agent today, assume a generalist will be competitive with it within 18 months.

2.3. Production, not just benchmarks

Here's what makes AlphaEvolve different from every "LLM discovered a new algorithm" paper before it: it's in production at Google.

  • Borg scheduler heuristic: in production for over a year, recovering 0.7% of Google's worldwide compute on average. That's a genuinely large number in dollar terms.
  • Gemini training kernel: 23% speedup on a critical matmul kernel, contributing to a 1% reduction in Gemini's total training time.
  • FlashAttention GPU kernel: 32.5% speedup on the Transformer attention kernel.
  • Verilog for a next-gen TPU: AlphaEvolve simplified an arithmetic circuit with verified functional correctness. Shipping in hardware.

2.4. 75% and 20% on open problems

On a suite of 50+ open problems in math (analysis, geometry, combinatorics, number theory):

  • About 75% of the time, AlphaEvolve rediscovered the known state of the art.
  • About 20% of the time, AlphaEvolve improved on the known state of the art.
  • Setup for each problem took "hours," not weeks.

That's the fastest human-to-result mapping we've seen from an LLM system on math research problems.

2.5. The agentic pattern is dead simple

Distill AlphaEvolve down and you've got three primitives:

  1. LLM proposes variants
  2. Cheap automated verifier scores them
  3. Evolutionary selection concentrates compute on promising branches

That's the minimum viable discovery agent. It ports to any domain where (a) you can express the output as code and (b) you can automatically verify it.

3. How to do it

3.1. Minimal AlphaEvolve-style pipeline (Claude version)

import asyncio
from anthropic import AsyncAnthropic
from heapq import nlargest
import random

client = AsyncAnthropic()

# ---------- 1. The evaluator ----------
def evaluate(program: str, test_cases: list[tuple]) -> float:
    """Compile program, run against test cases, return score."""
    namespace = {}
    try:
        exec(program, namespace)
        fn = namespace["solve"]
        score = sum(1 for inp, expected in test_cases if fn(*inp) == expected)
        return score / len(test_cases)
    except Exception:
        return 0.0

# ---------- 2. The proposer (LLM) ----------
async def propose(parents: list[dict], task: str) -> str:
    context = "\n\n".join(
        f"# Score: {p['score']:.3f}\n{p['program']}" for p in parents
    )
    msg = await client.messages.create(
        model="claude-opus-4-7",  # "Pro" role
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"Task: {task}\n\n"
                       f"Here are previous attempts and their scores:\n{context}\n\n"
                       f"Write a NEW Python program defining `solve(...)` that scores higher. "
                       f"Only output the code.",
        }],
    )
    return msg.content[0].text

# ---------- 3. The evolutionary loop ----------
async def evolve(task, test_cases, generations=50, population=20, topk=5):
    population_db = []
    for gen in range(generations):
        # Pick parents (biased toward high scorers + some diversity)
        parents = (nlargest(topk, population_db, key=lambda p: p["score"])
                   if population_db else [])
        proposals = await asyncio.gather(*[
            propose(parents, task) for _ in range(population)
        ])
        for prog in proposals:
            score = evaluate(prog, test_cases)
            population_db.append({"program": prog, "score": score})
        best = max(population_db, key=lambda p: p["score"])
        print(f"Gen {gen}: best = {best['score']:.3f}")
    return best

# ---------- 4. Example usage ----------
asyncio.run(evolve(
    task="Implement `solve(a, b)` that returns a*b "
         "but uses at most 2 multiplications.",
    test_cases=[((3, 7), 21), ((-4, 2), -8), ((0, 99), 0)],
))

That's the whole pattern. You can and should upgrade it:

  • Replace evaluate with your domain verifier (compile-time checker, symbolic prover, empirical benchmark).
  • Replace Opus-only with Opus + Haiku ensemble (Haiku as the "Flash" variant).
  • Add diversity preservation: pick parents from clusters, not just top-K.
  • Add program-level crossover: prompt for "combine parts of A and B."

3.2. Domains where this pattern works today

  • Algorithm design: matrix multiplication, scheduling heuristics, sorting variants.
  • Kernel optimization: CUDA, Triton, Metal, TPU kernels with benchmarked throughput as the score.
  • Theorem proving / formal verification: Lean/Coq proof search with typecheck as the verifier.
  • Program synthesis: any domain with unit tests. Your test suite is the evaluator.
  • Chip/RTL design: Verilog plus functional-verification suite.
  • Scientific hypothesis generation: when "does this hypothesis match experimental data" is cheap to check.

3.3. Domains where it struggles

  • Open-ended creative tasks: the score function is subjective; LLM-as-judge is too noisy for evolutionary pressure.
  • Long-horizon tasks: many generations times an expensive evaluator equals prohibitive cost.
  • Domains where you can't afford full verification: an incorrect winner corrupts all descendants.

3.4. Production-scale considerations

Running AlphaEvolve-style pipelines at scale needs:

  • A distributed evaluator: evaluator throughput is the bottleneck; parallelize aggressively.
  • A caching layer: many proposals will be near-duplicates; cache eval results.
  • Cost caps: bound total compute per generation; LLM calls compound.
  • Diversity tracking: population entropy metrics to catch premature convergence.
  • Program-level deduplication: normalize AST before comparing.

4. What AlphaEvolve tells the rest of us

4.1. LLMs plus verifiers beat LLMs alone

The headline finding of agent research in 2024 to 2025 is that paired with a cheap, correct verifier, an LLM can solve problems the LLM cannot solve on its own. AlphaEvolve is the strongest public example.

4.2. Evolutionary selection beats linear sampling

Given a fixed compute budget, selecting from a population of programs and iterating beats one-shot sampling of N programs from the same LLM. It's the classical AlphaGo lesson applied to code.

4.3. The evaluator is the moat

Two teams with access to Gemini differ only by the quality of their evaluators. For algorithm discovery, that means:

  • Formal verifiers beat unit tests beat LLM judges beat humans.
  • Automated benchmarking infrastructure beats ad-hoc scripts.
  • Bet your eng hours on evaluator quality, not prompt engineering.

4.4. Hybrid ensembles beat monolithic models

Flash (fast, diverse) plus Pro (slow, deep) outperforms either alone. The pattern generalizes. Expect to see Haiku plus Opus pairs, gpt-5-main plus gpt-5-thinking pairs, and hybrid open-plus-closed ensembles for cost-sensitive discovery tasks.

5. Key takeaways

  • The minimum agentic discovery system is: LLM plus verifier plus evolutionary loop. You can build this in a weekend.
  • Your verifier is your moat. Invest there before you invest in prompts.
  • Ensembles of fast plus slow models work better than one big model. Use Flash-class for breadth, Pro-class for depth.
  • It already ships. 0.7% of Google's compute. 23% Gemini training speedup. This isn't futurism.
  • Spec to code to verify is a universal template. Every domain with a sound verifier is a candidate application.
Common questions

Frequently asked

What is the main architecture of AlphaEvolve and how does it work?

AlphaEvolve is a pipeline with three cooperating components: a prompt sampler that pulls high-scoring programs from an evolutionary database, an LLM ensemble using both Gemini Flash for breadth and Gemini Pro for depth to generate candidate programs, and an automated evaluator that compiles, runs, and scores each program against a specific metric. The system runs in a loop where new prompts are built from winning programs, losers are culled, and diversity is maintained across thousands of generations.

What concrete production results has AlphaEvolve achieved at Google?

AlphaEvolve has delivered multiple production improvements at Google: a Borg scheduler heuristic that has been running for over a year and recovers 0.7 percent of Google's worldwide compute on average, a 23 percent speedup on a critical Gemini training kernel contributing to a 1 percent reduction in total Gemini training time, a 32.5 percent speedup on a FlashAttention GPU kernel, and a simplified Verilog arithmetic circuit for a next-generation TPU that is shipping in hardware.

How does AlphaEvolve compare to DeepMind's earlier AlphaTensor system?

AlphaEvolve is a generalist system that uses the same framework across many domains, while AlphaTensor from 2022 was a specialist focused only on matrix multiplication discovery. AlphaEvolve found a strictly better result on the 4x4 complex matrix multiplication case that AlphaTensor could not achieve, discovering a new algorithm using 48 scalar multiplications and improving on Strassen's 1969 result for the first time in 56 years.

What are the minimum requirements to build an AlphaEvolve-style discovery system?

The minimum viable discovery agent requires three primitives: an LLM to propose program variants, a cheap automated verifier to score them, and evolutionary selection to concentrate compute on promising branches. This pattern works in any domain where you can express the output as code and automatically verify it, with the evaluator quality being more important than prompt engineering for achieving good results.

What was AlphaEvolve's success rate on open mathematical problems?

On a suite of over 50 open problems spanning analysis, geometry, combinatorics, and number theory, AlphaEvolve rediscovered the known state of the art about 75 percent of the time and improved on the known state of the art about 20 percent of the time. Setup for each problem took hours rather than weeks, making it the fastest human-to-result mapping seen from an LLM system on math research problems.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

AlphaEvolve: DeepMind's Evolutionary Coding Agent…