AlphaEvolve: DeepMind's Evolutionary Coding Agent Explained

Source: AlphaEvolve, DeepMind (Google DeepMind, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

AlphaEvolve is DeepMind's evolutionary coding agent. It uses Gemini Flash plus Gemini Pro as an ensemble to propose candidate programs, an automated evaluator to score them, and an evolutionary database to pick survivors for the next generation. The headline result: a new algorithm for 4x4 complex matrix multiplication using 48 scalar multiplications, the first improvement on Strassen's 1969 result for this setting in 56 years. And it's not just a paper trick. Real Google production: 0.7% of worldwide compute recovered through Borg data-center scheduling, a 23% speedup on a Gemini training kernel, a 32.5% speedup on a FlashAttention GPU kernel, and contributions to an upcoming TPU design.

The pattern it proves: LLM plus verifier in an evolutionary loop can discover algorithms humans couldn't.

1. What it is

AlphaEvolve is a pipeline, not a single model. Three cooperating components.

1.1. The prompt sampler

This piece assembles the next prompt by pulling high-scoring programs from the evolutionary database. It's basically an in-context "curriculum" that biases Gemini toward productive directions. Think of it as a contextual exploration policy encoded as prompt construction.

1.2. The LLM ensemble

Two Gemini variants in a speed-vs-depth split:

Model	Role
Gemini Flash	Breadth. Maximize variety of ideas explored.
Gemini Pro	Depth. High-quality, carefully reasoned variants.

Both produce candidate programs, actual code in Python or Verilog or whatever DSL the task uses.

1.3. The automated evaluator

Every candidate is compiled, run, and scored against a metric. The metric is the entire game. If your metric is "number of scalar multiplications to compute a matrix product," the evaluator counts multiplications. If your metric is "Verilog circuit's gate count subject to functional correctness," it synthesizes and verifies.

Critical insight: the evaluator must be automated, cheap, and sound. You can't outsource it to a human because the loop runs millions of programs. Soundness matters too, since an incorrect program that sneaks past a weak test becomes an evolutionary winner and poisons the descendants.

1.4. The evolutionary database

Stores scored programs and implements selection. New prompts get built from winners, losers get culled, and diversity is maintained so the search doesn't collapse onto a local optimum.

1.5. The loop

    ┌───────────────────────────────────────────────┐
    │                                               │
    │   Prompt Sampler ──▶ Gemini Flash + Pro       │
    │        ▲                 │                    │
    │        │                 ▼                    │
    │        │         Candidate Programs           │
    │        │                 │                    │
    │        │                 ▼                    │
    │        │         Automated Evaluator          │
    │        │                 │                    │
    │        │                 ▼                    │
    │        └─── Evolutionary Database ◀───┐       │
    │                      │                │       │
    │                      └── select best ─┘       │
    │                                               │
    └───────────────────────────────────────────────┘

Repeat for thousands of generations. The LLMs propose; the evaluator disposes.

2. Why it matters

2.1. Strassen to 48 multiplications: the 56-year result

For decades, multiplying two 4x4 complex matrices using Strassen's recursive decomposition (1969) was the known state of the art. AlphaEvolve found a new algorithm using 48 scalar multiplications for this setting, an improvement over Strassen.

For non-ML readers: matrix multiplication sits at the bottom of basically all modern computing (graphics, simulation, neural networks, scientific computing). Any improvement compounds across every GPU kernel ever run.

2.2. AlphaEvolve eats AlphaTensor

DeepMind's 2022 AlphaTensor was a specialist matrix-multiplication discoverer. AlphaEvolve is a generalist, same framework across many domains, and it found a strictly better result on the 4x4 complex case that AlphaTensor couldn't touch.

The lesson: specialist to generalist is the dominant trajectory in algorithm discovery. If you're building a specialist agent today, assume a generalist will be competitive with it within 18 months.

2.3. Production, not just benchmarks

Here's what makes AlphaEvolve different from every "LLM discovered a new algorithm" paper before it: it's in production at Google.

Borg scheduler heuristic: in production for over a year, recovering 0.7% of Google's worldwide compute on average. That's a genuinely large number in dollar terms.
Gemini training kernel: 23% speedup on a critical matmul kernel, contributing to a 1% reduction in Gemini's total training time.
FlashAttention GPU kernel: 32.5% speedup on the Transformer attention kernel.
Verilog for a next-gen TPU: AlphaEvolve simplified an arithmetic circuit with verified functional correctness. Shipping in hardware.

2.4. 75% and 20% on open problems

On a suite of 50+ open problems in math (analysis, geometry, combinatorics, number theory):

About 75% of the time, AlphaEvolve rediscovered the known state of the art.
About 20% of the time, AlphaEvolve improved on the known state of the art.
Setup for each problem took "hours," not weeks.

That's the fastest human-to-result mapping we've seen from an LLM system on math research problems.

2.5. The agentic pattern is dead simple

Distill AlphaEvolve down and you've got three primitives:

LLM proposes variants
Cheap automated verifier scores them
Evolutionary selection concentrates compute on promising branches

That's the minimum viable discovery agent. It ports to any domain where (a) you can express the output as code and (b) you can automatically verify it.

3. How to do it

3.1. Minimal AlphaEvolve-style pipeline (Claude version)

import asyncio
from anthropic import AsyncAnthropic
from heapq import nlargest
import random

client = AsyncAnthropic()

# ---------- 1. The evaluator ----------
def evaluate(program: str, test_cases: list[tuple]) -> float:
    """Compile program, run against test cases, return score."""
    namespace = {}
    try:
        exec(program, namespace)
        fn = namespace["solve"]
        score = sum(1 for inp, expected in test_cases if fn(*inp) == expected)
        return score / len(test_cases)
    except Exception:
        return 0.0

# ---------- 2. The proposer (LLM) ----------
async def propose(parents: list[dict], task: str) -> str:
    context = "\n\n".join(
        f"# Score: {p['score']:.3f}\n{p['program']}" for p in parents
    )
    msg = await client.messages.create(
        model="claude-opus-4-7",  # "Pro" role
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"Task: {task}\n\n"
                       f"Here are previous attempts and their scores:\n{context}\n\n"
                       f"Write a NEW Python program defining `solve(...)` that scores higher. "
                       f"Only output the code.",
        }],
    )
    return msg.content[0].text

# ---------- 3. The evolutionary loop ----------
async def evolve(task, test_cases, generations=50, population=20, topk=5):
    population_db = []
    for gen in range(generations):
        # Pick parents (biased toward high scorers + some diversity)
        parents = (nlargest(topk, population_db, key=lambda p: p["score"])
                   if population_db else [])
        proposals = await asyncio.gather(*[
            propose(parents, task) for _ in range(population)
        ])
        for prog in proposals:
            score = evaluate(prog, test_cases)
            population_db.append({"program": prog, "score": score})
        best = max(population_db, key=lambda p: p["score"])
        print(f"Gen {gen}: best = {best['score']:.3f}")
    return best

# ---------- 4. Example usage ----------
asyncio.run(evolve(
    task="Implement `solve(a, b)` that returns a*b "
         "but uses at most 2 multiplications.",
    test_cases=[((3, 7), 21), ((-4, 2), -8), ((0, 99), 0)],
))

That's the whole pattern. You can and should upgrade it:

Replace evaluate with your domain verifier (compile-time checker, symbolic prover, empirical benchmark).
Replace Opus-only with Opus + Haiku ensemble (Haiku as the "Flash" variant).
Add diversity preservation: pick parents from clusters, not just top-K.
Add program-level crossover: prompt for "combine parts of A and B."

3.2. Domains where this pattern works today

Algorithm design: matrix multiplication, scheduling heuristics, sorting variants.
Kernel optimization: CUDA, Triton, Metal, TPU kernels with benchmarked throughput as the score.
Theorem proving / formal verification: Lean/Coq proof search with typecheck as the verifier.
Program synthesis: any domain with unit tests. Your test suite is the evaluator.
Chip/RTL design: Verilog plus functional-verification suite.
Scientific hypothesis generation: when "does this hypothesis match experimental data" is cheap to check.

3.3. Domains where it struggles

Open-ended creative tasks: the score function is subjective; LLM-as-judge is too noisy for evolutionary pressure.
Long-horizon tasks: many generations times an expensive evaluator equals prohibitive cost.
Domains where you can't afford full verification: an incorrect winner corrupts all descendants.

3.4. Production-scale considerations

Running AlphaEvolve-style pipelines at scale needs:

A distributed evaluator: evaluator throughput is the bottleneck; parallelize aggressively.
A caching layer: many proposals will be near-duplicates; cache eval results.
Cost caps: bound total compute per generation; LLM calls compound.
Diversity tracking: population entropy metrics to catch premature convergence.
Program-level deduplication: normalize AST before comparing.

4. What AlphaEvolve tells the rest of us

4.1. LLMs plus verifiers beat LLMs alone

The headline finding of agent research in 2024 to 2025 is that paired with a cheap, correct verifier, an LLM can solve problems the LLM cannot solve on its own. AlphaEvolve is the strongest public example.

4.2. Evolutionary selection beats linear sampling

Given a fixed compute budget, selecting from a population of programs and iterating beats one-shot sampling of N programs from the same LLM. It's the classical AlphaGo lesson applied to code.

4.3. The evaluator is the moat

Two teams with access to Gemini differ only by the quality of their evaluators. For algorithm discovery, that means:

Formal verifiers beat unit tests beat LLM judges beat humans.
Automated benchmarking infrastructure beats ad-hoc scripts.
Bet your eng hours on evaluator quality, not prompt engineering.

4.4. Hybrid ensembles beat monolithic models

Flash (fast, diverse) plus Pro (slow, deep) outperforms either alone. The pattern generalizes. Expect to see Haiku plus Opus pairs, gpt-5-main plus gpt-5-thinking pairs, and hybrid open-plus-closed ensembles for cost-sensitive discovery tasks.

5. Key takeaways

The minimum agentic discovery system is: LLM plus verifier plus evolutionary loop. You can build this in a weekend.
Your verifier is your moat. Invest there before you invest in prompts.
Ensembles of fast plus slow models work better than one big model. Use Flash-class for breadth, Pro-class for depth.
It already ships. 0.7% of Google's compute. 23% Gemini training speedup. This isn't futurism.
Spec to code to verify is a universal template. Every domain with a sound verifier is a candidate application.