DeepMind’s AlphaEvolve scales mathematical search. Proofs still need people.

A flashy claim met sober reality: Google DeepMind says it can industrialize mathematical discovery, and the early record backs parts of that up. Fields Medalist Terence Tao has already built new work on top of the system’s outputs, as described in Terence Tao’s detailed report. The gains are real—but they’re constructive, not turnkey proofs.

What’s actually new

AlphaEvolve doesn’t fine-tune a single solver. It evolves entire search strategies that generate candidate constructions, then ruthlessly scores them. That reframes many math problems as engineering problems: define a conservative verifier, then unleash vast search. Setup time drops from weeks of bespoke code to hours of prompt-plus-scorer design. That matters.

The Breakdown

• AlphaEvolve evolves search strategies to find mathematical constructions, cutting setup from weeks to hours

• System matched or improved bounds on 67 problems but hasn't cracked famous conjectures like Sendov

• Fields Medalist Tao builds proofs from its outputs, showing the human-AI research loop already works

• Real shift: high-quality mathematical constructions now discoverable at industrial scale, freeing humans for theory

On a 67-problem battery across analysis, combinatorics, and geometry, the system frequently matched known best constructions and nudged several state-of-the-art bounds. The list ranges from finite-field Kakeya and Nikodym sets to sphere packings and kissing-number configurations. Some wins are modest. Some are striking because they’re weird.

How it works, in practice

AlphaEvolve has two modes. In search mode, the LLM writes programs that search within a fixed compute budget; separating expensive LLM calls from the inner loop lets the system examine millions of candidates cheaply. In generalizer mode, it evolves code that solves families of instances across inputs, pushing toward formulas rather than one-off exemplars. It’s evolution over algorithms, not parameters. Simple idea. Big leverage.

Diversity helps. Costly models tend to converge faster; cheaper ones inject variability that sometimes escapes local optima. Teams can mix both, assigning premium calls to the bottleneck stages and using bargain runs to widen the search frontier. It’s less magic than portfolio management.

The evidence so far

Start with the comfortable: for classical inequalities—Hausdorff–Young, Gagliardo–Nirenberg, Young’s convolution—the system often rediscovered the known extremizers within an iteration or two. That validates the pipeline end-to-end.

The more interesting results are less pretty. In autocorrelation-type inequalities, AlphaEvolve found highly irregular step-function extremizers that improved published constants. They’re messy and counterintuitive. They also work. In higher-dimensional geometry, the system matched standard linear-programming baselines at low dimensions and nudged past them where human intuition thins out. Sphere packings, kissing numbers, dense configurations—incremental in size, nontrivial in effort saved.

Finite-field constructions point to the research model here. AlphaEvolve surfaced Kakeya and Nikodym candidates that inspired fresh human arguments. Tao and collaborators then extracted structure and produced proofs, sometimes with help from symbolic-reasoning tools and formal verification. That’s the loop: machine proposes, human disposes.

The pipeline that (mostly) runs

DeepMind frames this as a pipeline, not a product. AlphaEvolve supplies candidates; a separate component drafts informal proofs; a formal layer checks them in systems like Lean. End-to-end, that’s discovery, explanation, and certification. Today, the full arc lands for select problems with clean structure and verifiers. It’s not universal yet. Still, the direction is clear.

Verification is the hard part. If your scorer allows floating-point slop, the agent will exploit it. If your geometry checker samples too coarsely, it will “clip” through constraints. The teams switched to exact or interval arithmetic and designed conservative, worst-case scoring. Once they hardened the verifiers, the results held up. Lock the gate, then let the foxes run.

Economics and scale

Industrialized search means parallelism, and parallelism has shape. More threads cut wall-clock time but raise total compute. Expensive LLMs reduce the number of generations; cheap ones bulk up the branching factor. The pragmatic strategy is a tiered mix tied to problem phase and priority. It’s classic exploration versus exploitation with a GPU budget.

Just as important: setup velocity. Historically, these problems demanded careful parameterization by domain experts before any search could start. Here, experts invest in the score function and a lean problem statement; the tool iterates on everything else. That pushes the scarce human time to where it compounds.

Limits, stated plainly

No famous conjecture fell. On Sendov, Sidorenko, Crouzeix, and other marquee targets, AlphaEvolve matched the usual candidates and stopped. In analytic number theory—e.g., designing sieve weights—the system underperformed even with expert nudges. And the team warns against reading small-n improvements as asymptotic truths. Healthy humility, baked in.

Even the success stories need care. Many “wins” are constructive records, not certified optimality. Formal verification covers only part of the portfolio today. Progress will depend on stronger proof tooling, larger curated problem sets, and verifiers that are both airtight and efficient. It’s a marathon, not a press release.

The shift to “constructive at scale”

The big idea is neither “AI proves theorems” nor “AI replaces mathematicians.” It’s that high-quality constructions—the raw ore of many proofs—are now discoverable at industrial scale. Humans still do the distillation. The loop is already producing papers and, more importantly, new lines of attack. That’s a real shift.

Why this matters

Search, not insight, is the bottleneck in many areas of math; scaling constructive search frees human cycles for proofs and theory.
Compute and verification now shape who advances in pure math, extending winner-take-most dynamics from AI labs to theoretical research.

DeepMind’s AlphaEvolve scales mathematical search. Proofs still need people.

Deaths, Bailouts and a Trillion-Dollar Problem

Moonshot’s $4.6 million ‘Kimi K2 Thinking’ takes top spots on reasoning benchmarks