Budding Planted February 9, 2026 16 min read

Building Bourbaki: A Math Agent That Can Check Its Own Work

How I built an autonomous agent for mathematical reasoning with SymPy, Lean 4, and structured proof techniques, and what I learned along the way

#ai-agents #agentic-engineering #program-synthesis #theorem-proving #math #claude-code

I learn by building things. The way I figure out if an idea makes sense is to sit down and build the thing and see what happens. Reading papers helps. Thinking about it helps more. But building it is where I actually learn.

The building-erdos-navigator came from wanting to give a coding agent a structured environment for math problems. It’s a database of 1,179 Erdos problems with a REST API, a CLI, and skills that tell the agent how to explore. It worked. The agent could find problems, check if they’d been formalized, understand what’s been tried. But it couldn’t compute. It couldn’t verify. It could talk about math but it couldn’t do math.

That gap bugged me. So I built Bourbaki.

Why math, why now

If you’re going to build agents that help with science, math is a good place to start. Not because math is easy for agents. It’s not. But because math has something most domains don’t: a way to check if you’re right. A proof is correct or it isn’t. Lean 4 can verify it mechanically. There’s no “well, it depends on your interpretation.”

This connects to something I’ve been thinking about since the building-erdos-navigator work. The LLM proposes, the formal system verifies. That’s the pattern. And it pours into program synthesis, which is what got me into this in the first place. LLM-guided synthesis is really just this pattern applied to code: the model generates candidates, something else checks them. Whether “something else” is a test suite, a type checker, or a proof assistant, the loop is the same.

If you’ve read the Erdos Navigator post, you might wonder: why Lean 4 and not Z3? I used Z3 in my earlier synthesis work and it’s good at what it does. But Z3 is an SMT solver. It answers “is this set of constraints satisfiable?” It works in decidable theories: boolean logic, linear arithmetic, bit vectors, arrays. You give it a formula and it says yes or no. That’s perfect for program synthesis where you’re checking if a candidate program meets a specification.

But math proofs aren’t constraint satisfaction problems. When you want to prove that the square root of 2 is irrational, you’re not asking “is this satisfiable?” You’re constructing an argument from axioms, using lemmas and theorems that other people have already proven. That’s what Lean 4 does. It’s a proof assistant with Mathlib, a library of over 100,000 formalized mathematical results. You can build on top of existing proofs. You can use tactics like ring, omega, norm_num that automate entire categories of reasoning. Z3 can tell you a formula is valid. Lean can tell you a proof is correct, and it can help you find one.

For Bourbaki, I wanted both: SymPy for the kind of computation Z3 is bad at (symbolic integration, series expansion, matrix operations), and Lean for the kind of verification Z3 can’t do (multi-step mathematical proofs that build on Mathlib). SymPy computes, Lean verifies.

Two recent papers frame why this matters right now.

The first is First Proof (Abouzaid et al., 2026). A group of mathematicians, including a Fields Medal winner, published ten research-level math problems and encrypted the answers. The point is to test whether AI systems can solve problems that haven’t been in any training set. Not textbook exercises. Not competition math. Real research questions from real mathematicians. The answers are locked until someone claims a solution. It’s a benchmark that can’t be gamed by memorization.

The second is Semi-Autonomous Mathematics Discovery with Gemini (Feng et al., 2026). They pointed Gemini at 700 open Erdos problems and resolved 13 of them. But here’s what’s interesting: most of those solutions already existed. The problems were “open through obscurity rather than difficulty.” The answers were sitting in journals nobody had connected to the right questions. The bottleneck wasn’t reasoning. It was search. Literature search, pattern matching, connecting results across different terminology and subfields.

The state of math + proofs + LLMs

Here’s who’s building what.

Aristotle (Harmonic, 2025) solved 5 out of 6 problems on the 2025 International Mathematical Olympiad. Gold-medal performance, machine-verified in Lean 4. Their system uses a custom 200B+ parameter model fine-tuned with reinforcement learning, Monte Carlo Graph Search over Lean tactic space, and a lemma decomposition pipeline that generates informal proofs, breaks them into lemmas, formalizes each one, and revises on failure. They also do test-time training, where the model retrains on its own search traces during inference. This runs on GKE clusters with massive compute. It’s a proof search engine.

LeanDojo (Caltech) is building the research tooling layer. Retrieval-augmented LLMs for Lean proof search, LeanCopilot as an IDE assistant, LeanAgent for lifelong learning in theorem proving. Their goal is to make formal verification more accessible by putting LLMs in the loop as proof assistants. If Aristotle is trying to replace the mathematician, LeanDojo is trying to give the mathematician better autocomplete.

Axiom Math AI is building a reasoning engine. They’re less public about their approach so far, but their framing is about “the deep structure of reasoning engines” and “the philosophy of discovery itself.”

And then there’s Bourbaki. Which I built in a weekend. With off-the-shelf LLMs and no custom training.

Part of what made that possible is that Lean 4 has gotten much more approachable. I learned the basics from Tyler Josephson’s Lean for Scientists and Engineers course. If you’re coming from a programming background and not a math department, I’d start there. Most Lean tutorials assume you already think in terms of dependent types and propositions-as-types. Josephson’s course doesn’t. It starts from “you’re a scientist or engineer and you want to verify things.” He walks through tactics like ring and omega and norm_num in a way that makes you realize Lean isn’t as alien as it looks. A lot of the proof automation is already there in Mathlib. You just need to learn how to find it and call it.

Kevin Buzzard’s Xena Project has been pushing Lean as a tool for working mathematicians for years, and 2024 was the year it started clicking for the broader community. Mathlib keeps growing. The tooling keeps improving. The barrier to entry is lower than it’s ever been, and that matters because it means someone like me can actually integrate Lean into a weekend project instead of spending months on prerequisites.

I’m not pretending these are the same thing. Aristotle’s Lean integration is orders of magnitude deeper than mine. It searches through tactic space using MCGS with a custom policy model. Bourbaki just sends Lean code to a subprocess and reports back pass or fail. Aristotle is a research system from a well-funded team pushing the frontier. Bourbaki is a tool I built because I wanted to understand the pattern.

But the pattern is the same across all of them: LLM proposes, formal system verifies, feedback loops back. Aristotle does it with Monte Carlo search and custom models. LeanDojo does it with retrieval-augmented generation and IDE integration. Bourbaki does it with SymPy, a Lean subprocess, and OEIS in a terminal.

That second Gemini paper is basically the thesis behind all of this work. The gap isn’t that models can’t reason about math. The gap is that they don’t have tools. Give them computation and they can check their algebra. Give them a sequence database and they can identify patterns instead of guessing. Give them a proof assistant and they can’t hand-wave past the hard parts.

Here is Bourbaki

Named after Nicolas Bourbaki, the collective pseudonym of a group of mathematicians who tried to rewrite all of mathematics from scratch using set theory.

Claude Code gives an LLM a shell and dev tools so it can write and run code. Bourbaki does the same thing for math: it gives an LLM a computer algebra system (SymPy), a proof assistant (Lean 4), and research APIs (OEIS, arXiv).

You ask a question in the TUI, the agent computes, verifies, looks things up, and streams the answer back. If it writes a proof, it can formalize it. If it makes a claim, it can check it.

The toolbox:

Tool	What it does
Symbolic Compute	Native SymPy: simplification, integration, solving, 30+ operations
Lean Prover	Lean 4 + Mathlib, machine-checked formal proofs
Sequence Lookup	OEIS: identify and explore integer sequences
Paper Search	arXiv: find relevant papers and results
Web Search	Exa: search the web for mathematical references

The stack is a Python backend (FastAPI + Pydantic AI) connected to a React + Ink TUI via Server-Sent Events. The TUI is a display client. All reasoning, tool calls, and state live in the backend.

src/                          React + Ink TUI (display client)
├── components/               UI components (Input, AgentEventView, AnswerView)
├── hooks/                    useAgentRunner (SSE bridge), useModelSelection
└── skills/                   21 SKILL.md proof technique files

backend/bourbaki/             Python backend (owns all state)
├── agent/                    Pydantic AI agent, prompts, scratchpad, event mapper
├── tools/                    SymPy, Lean 4, OEIS, arXiv, Web Search, Skills
├── sessions/                 Persistence + context compaction
├── autonomous/               Long-running proof search with strategies
├── problems/                 13 classic problems database
└── server/routes/            FastAPI endpoints (query, sessions, skills, ...)

The loop

Bourbaki agent loop: TUI, Backend, and Tools

Here’s how the agent works when you give it a problem:

You ask a question in the TUI
The backend agent reasons about the approach
It calls tools: SymPy for computation, Lean for verification, OEIS/arXiv for lookup
Results feed back into the agent, which iterates if needed
A scratchpad enforces limits and deduplicates repeated calls
The final answer streams back to the TUI as it’s generated

The interesting parts are in steps 3 through 5.

The scratchpad. Without it, the agent loops. It’ll call the same tool with the same query rephrased slightly. The scratchpad tracks every tool call per query and does two things: it enforces a hard limit (3 calls per tool by default), and it detects duplicate queries using word-level Jaccard similarity at a 0.7 threshold. Split both queries into word sets, compute the intersection over union. If 70% of the words overlap, the scratchpad returns a warning: “Similar query already sent to {tool_name}. Consider a different approach.” The check happens before the tool executes, so the agent doesn’t waste a call.

At the limit, the agent gets a final warning: “This is the last allowed call to {tool_name}.” After that, the tool is blocked for the rest of the query. The scratchpad also formats a usage summary that gets injected into the iteration prompt, so the agent knows exactly how many calls it has left: search: 2/3 calls used (1 remaining). When tools are exhausted, the prompt tells the agent to work with what it has.

Skill files. These are proof techniques encoded as SKILL.md files with YAML frontmatter. There are 21 of them across five categories:

Basic: induction, strong induction, direct proof, contradiction, pigeonhole, counting
Analysis: epsilon-delta, convergence tests, sequence limits, inequality chains
Geometry: coordinate proof, synthetic construction, transformations
Algebra: group homomorphisms, ring ideals, polynomials
Advanced: extremal arguments, probabilistic method, conjecture exploration, formalization, proof explanation

Each file walks the agent through a specific proof technique step by step, including which tools to use at each stage. They get loaded into the system prompt as available techniques. The agent picks the technique based on the problem, not the user.

This is the same pattern as the skills in the Erdos Navigator, but for proof strategies instead of problem-solving workflows.

Skills can be added at three levels: built-in (src/skills/), user (~/.bourbaki/skills/), or project (.bourbaki/skills/). So if you’re working on algebraic geometry and have a specific technique you keep using, you can write a SKILL.md for it and the agent will pick it up.

The compute-then-verify pattern. This is the thing I care about most. SymPy proposes, Lean checks. The agent doesn’t get to say “it’s obvious that…” and move on. If it claims an algebraic identity, it can compute it. If it claims a proof step follows, it can formalize it.

Here’s what that looks like in practice:

❯ Prove that the sum of the first n integers equals n(n+1)/2

⏺ Thinking...
⏺ Symbolic Compute (expression=Sum(k, (k, 1, n)))
  ⎿ Computed result
⏺ Lean Prover
  ⎿ ✓ Verified in 2.3s

Proof by induction. Base case: n = 1, sum = 1 = 1·2/2. ✓
Inductive step: assume ∑_{k=1}^{n} k = n(n+1)/2.
Then ∑_{k=1}^{n+1} k = n(n+1)/2 + (n+1) = (n+1)(n+2)/2. ∎

The agent computed the symbolic sum with SymPy, then verified the full proof in Lean 4. Both steps happened inside the loop, not after the fact. That’s the difference between “the agent wrote a proof” and “the agent wrote a proof and checked it.”

The Lean integration is simple by design. The agent generates Lean 4 code, Bourbaki writes it to a temp file, runs lean (or falls back to lake env lean), and parses the output. The prover has a 30-second timeout. The response comes back as structured data: success, goals (remaining proof obligations), proofComplete, errors (each with line, column, message, severity), rawOutput, and duration in milliseconds.

Three things have to be true for proofComplete to be true: return code 0, no error-level messages, and no sorry in the code. That last one matters. sorry is Lean’s placeholder for “I’m skipping this step.” Lean accepts the file but the proof is incomplete. Bourbaki checks for it explicitly. The agent can’t sneak one past.

When Lean rejects something, the errors are structured. The prover parses them with a regex: filename:line:col: severity: message. The agent gets the full output and can try to fix the problem on the next iteration. Between that and the scratchpad’s 3-call limit on the Lean tool, the agent has at most three attempts to get a proof to compile before it has to work with what it has.

SymPy and Lean play different roles here. SymPy runs in-process (no subprocess overhead) and covers number theory (factorization, primality, Euler’s totient, modular inverse), algebra (simplify, expand, solve), calculus (derivatives, integrals, limits, Taylor series, Fourier series, Laplace transforms), series (infinite sums and products), and linear algebra (determinants, eigenvalues, row reduction, characteristic polynomials). The test suite verifies things like: Sum(k, (k, 1, n)) evaluates to n*(n+1)/2, factor_integer(84) returns {2: 2, 3: 1, 7: 1}, limit(sin(x)/x, x, 0) returns 1, det([[1,2],[3,4]]) returns -2.

SymPy is forgiving. You can ask it to factor an integer or compute a series and it almost always gives you something useful. Lean is not forgiving. A proof that’s 95% correct fails exactly as hard as a proof that’s 5% correct. The type checker doesn’t grade on a curve.

There’s also an autonomous mode for harder problems. The agent tries different strategies, backtracks when stuck, and remembers what worked and what didn’t across iterations. It tracks dead ends so it doesn’t revisit them. You can start it with /prove <problem_id> and let it run.

Where this leaves me

I keep building these environments for agents and finding the same thing. The agent’s raw reasoning is fine. The bottleneck is always the environment. Drop it into a blank room and it hallucinates. Give it a database, tools, and structured techniques and it does real work.

The building-erdos-navigator gave the agent a world of problems to explore. The make-an-honest-resume-for-your-coding-agent gave it context about me. Bourbaki gives it math tools.

I’m not claiming Bourbaki solves research-level math. It doesn’t. Aristotle can solve IMO problems autonomously. Bourbaki can help you work through a proof and check your steps. Those are different things. The First Proof benchmark exists precisely because we don’t have widely accessible systems that can do research-level math yet. But Bourbaki can handle the kind of math that falls apart when an LLM tries to do it from memory. Symbolic computation. Sequence identification. Formalizing a proof that the agent already knows how to write in natural language.

The gap between “agent explains a proof” and “agent produces a verified proof” is still large. Lean 4 is unforgiving. Translating mathematical intuition into formal logic is hard for humans and it’s hard for agents too. I know my linear algebra and calculus, but writing a formal proof is a different story. Aristotle solves the formalization gap with MCGS and custom RL training. LeanDojo solves it with retrieval-augmented search. I solve it by watching the agent fail and then adding a skill file that teaches it the technique it was missing. Different approaches, different budgets.

The scratchpad started as a quick fix for the looping problem and turned into one of the most important components. Without it, the agent was confident and repetitive. With it, the agent was forced to actually try new things when something wasn’t working.

I want to add direct integration with the Erdos Navigator database so the agent can pull problems and work on them inside Bourbaki’s compute-verify loop. That’s the obvious next step: the navigator provides the problems, Bourbaki provides the tools. I’m also curious about what LeanDojo is doing with retrieval-augmented proof search. Their LeanCopilot could probably slot into Bourbaki’s tool chain and make the Lean integration much deeper than “send code, get pass/fail.”

Or maybe the obvious next step is something I haven’t thought of yet. I keep building these things and ending up somewhere I didn’t plan. That’s the point. I’d rather be applying and learning than theorizing on the sidelines about why it works or won’t work.

References

Research

Abouzaid, M., Blumberg, A.J., Hairer, M., et al. (2026). First Proof.
Feng, T., et al. (2026). Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems.
Achim, T., et al. (2025). Aristotle: IMO-level Automated Theorem Proving. Harmonic.

Math + LLM tools

LeanDojo. Caltech. LLM-powered tools for Lean theorem proving.
Axiom Math AI. Reasoning engine for mathematical discovery.

Learning Lean

Josephson, T. (2024). Lean for Scientists and Engineers. YouTube.
Buzzard, K. (2024). Lean in 2024. Xena Project.