Evergreen Planted February 6, 2026 12 min read

Building Erdős Navigator: Claude Code as a Theorem Proving Assistant

How program synthesis led me down a rabbit hole through HoTT and algebraic topology, and back out through LLMs and a toolkit for 1,179 unsolved math problems

#ai-agents #agentic-engineering #program-synthesis #theorem-proving #math #claude-code

I love program synthesis. The idea that you can describe what a program should do and have a machine find the program for you. It’s the thing that got me into research in the first place.

And if you care about synthesis, you end up caring about verification, because you need to know if the program your machine found actually does what you asked. Verification uses formal logic. Formal logic takes you to type theory. Type theory takes you to Homotopy Type Theory. HoTT takes you to univalent mathematics. And then you’re reading about abstract algebraic topology at 2am wondering how you got here.

I got deep. Too deep. You know Cedric Villani? Fields Medal winner, proved nonlinear Landau damping for the Boltzmann equation, wears a giant spider brooch everywhere. He wrote a book called Birth of a Theorem about the obsessive process of working through a proof. I read that book. I watched his lectures. I started getting weirdly into spiders. I was re-watching Spider-Man. That’s how deep I was. That’s when I knew I’d lost the plot.

I started losing focus on the thing I actually cared about, which was synthesis. The math was beautiful but I was drowning. I’m not a mathematician. I can read papers, I can follow proofs if I sit with them long enough, but writing a full proof from scratch takes me days even for simple stuff. I’m a noob and I know it.

But here’s what I did understand: program synthesis and program verification go hand in hand. Better logic verification, whether formal or even pseudo-formal, makes synthesis better. And a solid application of synthesis can come from tools like Z3, theorem proving, and the math side of logic.

Then 2022 and 2023 happened. LLMs got good. (In brief: a lot went down. Scaling laws. Chain-of-thought. GPT-4. Code generation that actually worked. DeepSeek. Open weights. The whole landscape shifted in about 18 months and I’m skipping over it like it was nothing but honestly it deserves its own post.)

And I realized the LLM-plus-solver pattern was the move. Not replacing Z3 with an LLM. That would be stupid, LLMs hallucinate and formal solvers don’t. But using the LLM as the front end, the part that proposes candidates and decomposes problems, while keeping formal tools like Z3 or Lean as the back end that verifies. The LLM searches, the solver checks. That’s pseudo-synthesis. You’re not exhaustively searching a program space anymore, you’re letting a language model propose candidates and then verifying them formally.

Khan et al. published a paper called LLM-Guided Compositional Program Synthesis that shows a version of this: when an LLM fails on a hard programming-by-example task, you decompose it into subtasks and let the LLM solve each one. It beats self-reflection approaches. The paper is specifically about PBE, not theorem proving, but the decomposition pattern is the same one that keeps showing up everywhere.

Then AI started solving Erdos problems

If you haven’t been following this, here’s what happened. In late 2025, a bunch of Paul Erdos’s unsolved problems started getting solved, or at least re-solved, with AI assistance.

Terence Tao posted that Erdos Problem #728 was “solved more or less autonomously by AI.” What he found most interesting wasn’t the solution itself but the “emerging AI-powered capability to rapidly write and rewrite expositions of the solution.” The writing and communication part, not just the math.

Then 23 problems on erdosproblems.com got marked as solved in a single month. But here’s the thing. Most of them weren’t new solutions. They were existing solutions that nobody had connected to the problems. As Marek Kowalkiewicz wrote, the AI was doing matching work. Finding papers in obscure journals, connecting different terminology, bridging semantic gaps between fields. He calls it “the Erdos Gap,” the phenomenon where answers already exist but remain undiscovered.

This hit me hard because it’s exactly what program synthesis is about. You’re not always generating something new. Sometimes you’re searching through existing knowledge to find the piece that fits.

The Hacker News discussion around ChatGPT 5.2 Pro’s solution to Erdos Problem 281 was telling. People argued about whether the LLM was doing real reasoning or just regurgitating training data. But Tao pointed out that the proof approach was “rather different from the literature proof.” And the prior solution used Rogers’ Theorem, which appeared in a 1966 textbook footnote so obscure that even Erdos himself may not have known about it when he posed the problem in 1980.

The coding agent as RL agent

So if AI can find obscure theorems in textbook footnotes and match solutions across semantic gaps, the question becomes: how do you set that up? How do you give an agent the right environment to search in? That’s where this connects to what I actually build. Think about a coding agent like Claude Code as a reinforcement learning agent.

This isn’t just an analogy. The SWE-agent paper (NeurIPS 2024) literally models it this way. They frame software engineering as an interactive agent-environment problem with a designed action space: file editing, navigation, shell commands, submission. The agent observes, acts, and gets feedback. Since then, SWE-Search (ICLR 2025) added Monte Carlo Tree Search to balance exploration and exploitation, and Agent-R1 extended the MDP framework specifically for LLM agents with tool use.

So the framing is real. Your coding agent has an action space. It can look things up, read files, search databases, call tools, write code, run programs. It has an environment. And it has a goal.

The problem is that most people give their coding agent a blank environment and expect it to figure everything out from its training data. That’s like dropping an RL agent into a maze with no map and no prior exploration. It can stumble around and maybe find the exit, but it’s going to be slow and unreliable.

What if you gave the agent a navigator? A structured environment it can explore. A dataset it can query. Examples it can learn from. Tools it can call to verify its work.

That’s what I built with the erdos-navigator.

The Erdős Navigator

The project is a toolkit that makes all 1,179 of Erdos’s problems programmatically accessible. The point is that a coding agent can use the REST API, CLI, or Python SDK, along with a set of Claude Code skills, all sitting on top of a SQLite database. The database becomes the navigator. It’s how the agent learns about the environment it’s operating in, instead of relying purely on what’s in its training data.

Here’s what’s in the database:

1,179 problems with full LaTeX statements
651 of them still open
104 with monetary prizes (about $37,420 in unsolved bounties)
Community reactions from erdosproblems.com (hard, easy, formalisable, working_on)
Lean 4 formalization links for all problems via Google DeepMind’s formal-conjectures repo
354 problems already formalized
582 contributors tracked
Full-text search across all problem statements

The data comes from three sources: erdosproblems.com (Thomas Bloom’s site), Terence Tao’s structured metadata repo, and Google DeepMind’s Lean formalizations.

How the agent uses it

The API has an endpoint specifically for agents: /agent/suggest. It returns problems ranked by a difficulty score computed from community reactions and a tractability signal. The agent doesn’t just get a list of problems. It gets problems it has a realistic shot at, with context about what’s been tried, what’s been formalized, and what the community thinks.

The Claude Code skills in .claude/skills/ define the workflow. There’s an erdos-solver skill that walks the agent through six phases: select a problem, understand it deeply, pick a technique, attempt a solution, run a post-solution pipeline, and document everything. There’s a math-techniques library organized by domain (number theory, combinatorics, graph theory, geometry) with specific mappings like “Is there infinitely many…” → try Pell equations.

But the part I’m proudest of is the verification skill. It has the agent grade each component of its proof on an A-to-F scale. A means formal-ready, F means unverified claim. And the key question it asks: “Would a hostile referee accept this?”

That matters because LLMs are confident bullshitters. They’ll write a proof that reads beautifully and is completely wrong. The verification skill forces the agent to be honest with itself before declaring victory.

The workflow also enforces a local-first philosophy. Before the agent searches the web for anything, it checks the local database first. ./erdos get {number}. Then it checks the literature tracking file, which logs 16 problems already solved by AI with notes on which are novel vs rediscoveries. Then it checks prior solution attempts. Only after exhausting local sources does it go to the web. This is the navigator pattern: the agent has a rich local environment to explore before it goes wandering.

What happened when I used it

I’ve run this workflow on a few problems. Problem #137 asked whether the product of three or more consecutive integers can be a powerful number. The agent worked through three cases, used Pell equations and a periodicity argument mod 45, and arrived at a complete proof that the answer is no. The verification skill rated every component A or B. The novelty check came back clean. It’s likely a novel result, pending peer review.

Problem #108 was more interesting, honestly. Chromatic number and girth problem. The agent tried four approaches: edge partition, iterative deletion, random subgraph, structural decomposition. All of them hit the same barrier at r=5. The r=4 case is tractable because 4-cycles have clique structure you can exploit. At r=5 that structure disappears and every approach fell apart at the same point.

The agent documented exactly why each approach failed, what the barrier was, and concluded that the problem “likely requires new techniques beyond edge-partition methods.” No fake proof. No hand-waving past the gap. Just a clear explanation of where the wall is.

I think that’s actually the more valuable output. Anyone can generate a proof-shaped document that doesn’t survive scrutiny. Knowing when you’re stuck, knowing exactly why you’re stuck, and being honest enough to stop there instead of pushing through with garbage, that’s the skill that matters. The verification framework forced the agent to do this. Without it, I’m pretty sure it would have produced something that looked like a proof and wasn’t.

This is the agent acting like an RL agent with a proper environment. It has observations (the problem database, the technique library, prior attempts), actions (search, lookup, prove, verify), and a reward signal (does the proof survive a hostile referee).

The CLI

If you want to poke around the problems yourself without spinning up the API, there’s a CLI. I use it mostly when I’m reading about a problem and want to quickly check if it’s been formalized or if anyone’s marked it as tractable:

# Get a specific problem
./erdos get 728

# Search for problems about prime gaps
./erdos search "prime gaps"

# Find tractable open problems
./erdos find --tractable --open

# See what problems have prizes
./erdos prizes

The navigator pattern

The Erdős Navigator is one instance of something I think matters a lot for agentic engineering. I’m calling it an “environment navigator.” It’s the structured dataset and tooling that gives your coding agent something to explore beyond its own training data.

What I ended up needing was: a database the agent could search (SQLite with FTS5, nothing fancy), an API that returns more than raw data (computed difficulty scores, tractability signals, stuff that helps the agent decide what to try next), some curated starting points so it doesn’t waste cycles searching through everything, a way to verify its own work, and skills that describe the process. I didn’t plan all of that upfront. I added each piece when the agent hit a wall without it.

You could build this for anything. Protein folding datasets. Open-source bug databases. Legal case archives. I’m planning to add direct integrations with theorem provers into the Erdős Navigator skills so the agent can call out to Lean or Z3 mid-workflow without leaving the loop. The point is always the same: don’t drop your agent into a blank room and expect brilliance. Give it a world to explore.

Where this leaves me

I spent years going from program synthesis to type theory to algebraic topology and back. The scenic route. Most of it was way over my head and I knew it the whole time. But the core insight survived the detour: you can’t synthesize anything useful if you can’t verify it, and the better your verification gets, the harder you can push your synthesis.

LLMs didn’t replace Z3 or Lean. They added a new layer. The LLM proposes, the formal system verifies. The coding agent searches, the environment navigator guides. It’s the same pattern at every level.

I’ve also been tinkering with building the coding agent itself. I’m calling it bourbaki. The idea is that understanding how to design the agent loop, the observe-act-verify cycle, is one of the best ways to understand how AI-assisted theorem proving actually works under the hood. That’s a story for another post.

For now, the Erdős Navigator is open and usable. 1,179 problems, 651 of them still open. Some of them have been open for fifty years. Maybe an agent with a good navigator can find what’s been hiding in a 1966 textbook footnote.

Or maybe not. I don’t know. I’m a noob with a database and a stubborn coding agent. But I’d rather be searching than not.

References

Research

Khan, R., Gulwani, S., Le, V., Radhakrishna, A., Tiwari, A., & Verbruggen, G. (2025). LLM-Guided Compositional Program Synthesis.
Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K.R., & Press, O. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024.
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search. ICLR 2025.
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning. 2025.

AI and Erdos Problems

Tao, T. (2026). Erdos Problem #728 solved autonomously by AI. Mathstodon.
Kowalkiewicz, M. (2026). Already Solved. Substack.
AI Cracks Legendary Erdos Problems. The Neuron Daily.
Erdos 281 Solved with ChatGPT 5.2 Pro. Hacker News.

Books

Villani, C. (2012). Birth of a Theorem: A Mathematical Adventure.

Data Sources

Bloom, T. erdosproblems.com.
Tao, T. erdosproblems structured metadata. GitHub.
Google DeepMind. formal-conjectures. GitHub.