Seedling Planted May 3, 2026 12 min read

Sutro Yaro: Agent-Driven Research on Hinton's Problems

How a SPEC issue, a wave of Claude Code agents, and GitHub PRs reproduced 53 Hinton experiments from 1981 to 2022 in a single day

Today, 53 of Geoffrey Hinton’s experimental papers from 1981 to 2022 got reimplemented in a single repository. I did not write a line of code. Claude Opus 4.7, running on its 1M-token context window, built the whole thing in under 800,000 tokens.

hinton-problems is the result. Each problem has its own folder: a Python file with the dataset, model, and training loop; a README following an 8-section template; a GIF showing the learning dynamics; visualizations of weights and curves. Pure NumPy and matplotlib. Every stub runs on a laptop CPU. The catalog spans Boltzmann machines (1985), backprop and family trees (1986), wake-sleep (1995), products of experts (2002), capsule routing (2017), GLOM (2021), and Forward-Forward (2022). 27 reproduce paper claims, 25 are partial reproductions with the gap documented, 1 fails to replicate. PRs #32 through #41 all merged on 2026-05-03.

This post is about how that happened. It’s a sequel to the first Sutro Yaro post, which covered the why — bounded environments, the drosophila of learning, agents as researchers. This one is the how. The SPEC is a GitHub issue. The collaboration is PRs. The unit of work is a wave of agents. And Hinton’s problems are the substance underneath.

The chain

Three repositories sit on top of each other.

Sutro is Yaroslav Bulatov’s. Sparse parity benchmark, pure Python, no PyTorch and no NumPy. Manual forward and backward, gradients computed by hand. Fused layer-wise updates so memory reuse distance stays small. The clock counts floats moving through cache, not flops. The point is to make the energy cost of a learning algorithm legible and small. Sutro is the thesis: 1960s problems, modern tools, energy as the optimization target.

SutroYaro is the research workspace I built around Sutro for the Sutro Group — Yaroslav’s Monday meetup at South Park Commons. Locked harness, SHA256-checked. research/search_space.yaml says what an agent is allowed to mutate. findings/_template.md says what an agent has to produce. AGENT.md says how the agent has to behave. Multi-tool: Claude Code, Gemini CLI, Codex, Replit, all hitting the same harness. 36 experiments logged across three challenges.

hinton-problems is the new one. Same operating model, different scope. Instead of one challenge with 36 experiments, it’s 53 separate problems each with its own folder, its own paper, its own GIF. The thing the methodology is being asked to do here is reproduce a body of literature, not run experiments inside one harness. Different shape, same machinery.

The first post on this site was the philosophical one. This is the implementation report.

What “Hinton’s problems” means

I picked the name carefully. Most of the methods in this repo are not Hinton’s papers. Predictive coding is Rao and Ballard. Equilibrium propagation is Scellier and Bengio. Target propagation comes from Bengio’s group. Forward-Forward is Hinton (2022) and so is GLOM, but most of what’s in the catalog isn’t.

What Hinton has done for forty years is point at the right questions. How do you learn without backprop. How do you assign credit using only signals that are locally available to a neuron. How do you train at brain-like energy budgets. Why does a network with the wrong inductive bias take exponentially more compute. The methods are by many people. The agenda is his.

The problems are what fall out when you take that agenda seriously and try to make it concrete. Encoder networks where the bottleneck has to discover a binary code. Family trees where distributed representations have to encode relational structure. Shifter networks where two populations of binary units have to bind across a topographic shift. Wake-sleep on a Helmholtz machine. Boltzmann machines on parity. Forward-Forward on recurrent MNIST. None of these are state of the art. All of them are the smallest problems that expose the underlying capability, or the lack of it. Like sparse parity in the first post, they’re the drosophila.

The 27/25/1 split is honest reporting. Where a v1 implementation reproduces the paper’s claim, it says so. Where the algorithm works but a paper-config gap remains — different hidden-unit count, contrastive divergence instead of simulated annealing, modern hardware constants — it says partial and documents the gap. Where the claim doesn’t replicate at all, it says no and explains the three causes that account for it. Cherry-picking the wins and hiding the rest is the move I most want this kind of work to not be.

The SPEC is Issue #1

The contract between me and the agents is a single GitHub issue.

hinton-problems#1 is titled “Spec: minimum implementation requirements for stub problems (v1).” It lists the required files, the eight README sections, the reproducibility rules, the acceptance checklist. Every per-stub issue links back to it. Every PR is reviewable against the same checklist. A reviewer — human or agent — runs through eight checkboxes and either merges or sends it back.

The issue body opens with “this catalog has 53 problem stubs and one worked example. To dispatch parallel agents on the remaining stubs, we need a uniform shape.” The shape is derived from two existing worked examples. After that it’s tables and bullet points and a checklist. It signs off _agent-0bserver07 (Claude Code) on behalf of Yad_.

The agent wrote the SPEC. I reviewed it. I edited it. Yaroslav commented on it. Then the same agent — and others like it — implemented against it.

This is what I meant in Chimera when I said the SPEC is the DSL. The grammar of allowed work is encoded in those eight sections plus the checklist. An agent given that issue and a paper citation has a tight enough constraint to produce something a reviewer can validate quickly. It cannot wander into “I built a whole Helmholtz machine library.” It has to put the dataset in <slug>.py, the README in the eight sections, the GIF at the top, the deviations honestly listed, at least one open question. If those things aren’t there, the PR doesn’t pass.

I keep coming back to the program synthesis frame because it keeps mapping. The SPEC is the oracle. The repo layout is the DSL. The agent is the synthesizer. The acceptance checklist is the verifier. The CEGIS loop is the PR review cycle. The mapping is literal. I’m working through the formal version in an ICLR workshop draft on composable primitives for coding agents; this post is the lighter rendering of the same argument.

The collaboration runs on PRs

Yaroslav left a comment on issue #1 asking for a specific change to the catalog tables.

PR #47 ships with a body that opens “Closes Yaroslav’s literal ask in spec issue #1” and links the comment. PR #44 swaps MkDocs for mdBook because Yaroslav had asked for mdBook originally. PR #43 set up the docs site separate from SutroYaro’s. Same author handle on every PR, same sign-off line at the bottom: _agent-0bserver07 (Claude Code) on behalf of Yad_.

Yaroslav and I are not on a call when this happens. He’s working on Sutro and the sister repo sutro-problems. I’m reviewing what the agent shipped overnight. Sometimes he leaves a comment. Sometimes he opens an issue. Sometimes he merges something on his side that I read in the morning. The substrate that lets us cooperate is gh issue, gh pr, gh pr review. The cadence is asynchronous and it’s mostly text.

Two things make this scale. The first is the SPEC, because once the contract is fixed I don’t have to re-explain the shape of the work for every PR. The second is bin/tg-sync and the Google Docs sync that lives on the SutroYaro side. Telegram is where the Sutro Group talks. Google Docs is where Yaroslav writes long-form. Both feed back into the repo as committed text the agent can read. A discussion that happens at 11pm in Telegram becomes context the next morning’s agent uses when it picks up the next issue. The agent doesn’t have to be in the room when the conversation happens. It reads the transcript.

The first Sutro Yaro post said: agents need structure to produce research instead of noise. The structure I described there was the harness and the search space. Half a year later, the structure I find myself relying on most is the issue tracker.

The wave is the unit of work

Phase 2 of SutroYaro was 17 Claude Code agents dispatched in parallel. Each got a different approach to sparse parity, the shared module APIs, three test configs, and a findings template. All 17 finished. Times ran from 2.5 minutes to 38 minutes. I wrote a script to fan them out and a script to merge them back.

For hinton-problems the wave is bigger and the orchestration is different. There are 53 stubs and a worked example. Each stub is a paper. Each paper is its own slug, its own folder, its own PR. PRs #32 through #41 are the bulk-implementation wave that finished today. The work for each stub is independent by construction — one folder per problem, no shared state — which is the precondition for parallelizing it.

This is where Claude Code’s agent-teams feature lands. It’s experimental, off by default, gated behind CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, and requires v2.1.32 or later. The model is one lead session and N teammates. Each teammate is a full independent Claude Code instance with its own context window. They share a task list and a mailbox. The lead doesn’t have to summarize back to me — teammates can talk to each other, claim work off the list, and notify the lead when they go idle.

The doc is honest about the cost: “Agent teams use significantly more tokens than a single session. […] For research, review, and new feature work, the extra tokens are usually worthwhile. For routine tasks, a single session is more cost-effective.”

What changes when this becomes a primitive instead of a script: the dispatch is in the tool, the messaging is in the tool, the task claiming with file locks is in the tool, the cleanup is in the tool. What I used to hand-roll for SutroYaro Phase 2 — fan out, collect, merge — is the default behavior now. I still write the SPEC. I still review the PRs. I still own the merge. The wave gets cheaper to launch, which means the question shifts from “is dispatching 17 agents worth the engineering” to “what’s the next problem big enough to deserve a wave.”

The session that produced hinton-problems left these counts. One TeamCreate. 53 inter-teammate messages. 62 subagent dispatches. 191 bash calls. 15 PRs opened, 6 merged, 18 issues created, 24 git pushes. All from one Claude Code session running against the SPEC.

I don’t want to overclaim. agent-teams doesn’t replace any of the load-bearing work. The harness still has to be locked. The SPEC issue still has to be precise. The acceptance checklist still has to be unambiguous. The collaboration substrate is still GitHub. agent-teams gives me the dispatcher, not the science.

What this proves

53 stubs. One day. 27 full reproductions, 25 partial, 1 honest no.

The thing this is evidence for is not “agents can write code.” That’s been true for a year. The thing this is evidence for is that a SPEC-driven, PR-shaped, wave-dispatched workflow can absorb a long tail of small reproducible tasks at a rate that makes literature reproduction tractable for a one-person research project. The 53 papers span 41 years of Hinton’s career. Reading and reproducing 53 papers used to be a PhD-scale undertaking. The PR rate this week says it’s something else now.

What this is not evidence for: that the implementations are state of the art, that they’d hold up to peer review at NeurIPS, that the partial reproductions are close enough to count as positive results. They’re v1. The README for each one says so. The next pass — already filed as future issues — adds the energy metric (ByteDMD), tightens the failed-replication analysis, and pushes the partials toward full reproduction at paper scale. v1 was correctness and visualization. v1.5 is paper-scale parity. v2 is energy.

The methodology has a known failure mode I want to name. Issue #1 is the SPEC, and the SPEC was written by an agent. If the SPEC has a blind spot, every PR will inherit it. I’ve already seen this once: the v1 SPEC didn’t require energy measurement, so 53 stubs got built without it, and now it’s a v2 follow-up across all of them. The cost is real. The fix is to keep treating the SPEC as a living artifact and to keep the human in the loop on its revisions.

Where this leaves me

The pattern keeps holding. Build a small set of primitives. Write the SPEC as files in a repo. Let agents work against the SPEC in waves. Merge through GitHub. Don’t let the agent grade itself.

Chimera decomposes coding agents into primitives. SutroYaro gives them a research protocol. hinton-problems is what happens when you point that machinery at someone else’s published agenda and let it run for a while. Same loop, different target.

I keep waiting for this to stop working. So far it hasn’t.

If you want to look at the catalog, the site is at cybertronai.github.io/hinton-problems. The full results table is in RESULTS.md. Every stub has its own README with the deviations from the original procedure listed honestly. Every PR has its sign-off line at the bottom. None of them say I wrote the code. I didn’t write a line of it.