Sutro Yaro: Agent-Driven Research on Energy-Efficient Learning
How a study group in San Francisco turned coding agents into research agents by going back to 1960s AI problems with modern tools
My friend Yaroslav Bulatov runs a study group called the Sutro Group out of South Park Commons in San Francisco. They meet on Mondays and work on energy-efficient AI training. Yaroslav has been in ML research for a long time, working on large-scale training, optimization methods, gradient computation tools. He’s the kind of person who picks a problem because it’s the right problem, not because it’s fashionable.
The group’s thesis, from the project’s CONTEXT.md:
Go back to 1960s-era AI problems and reinvent learning algorithms using modern tools (AI agents, compute), with energy efficiency as the optimization target.
That thesis is what drew me in. It connects to something I’ve been circling around since the building-erdos-navigator and building-bourbaki work: coding agents as research agents.
SutroYaro is the shared workspace. It’s open source. The docs site has the full experiment log.
The drosophila of learning tasks
The group’s field guide describes sparse parity this way:
For energy-efficiency research, it is what Drosophila melanogaster is to genetics: small enough to iterate fast, structured enough to reveal real phenomena.
XOR was the example Minsky used to trigger the AI winter. Decades later, it’s still the simplest problem that exposes what learning algorithms can and can’t do.
You have 20 bits. Three of them are the secret. The other 17 are noise. The output is the XOR of the three secret bits. Your job is to figure out which three. It’s simple to state but hard to learn with standard methods. Parity has no first-order signal. Looking at any individual bit tells you nothing about the output. Only the interaction of all three secret bits together produces information.
This is the kind of problem Yaroslav loves. Small enough to solve in under a second on a laptop. Fast to iterate. Easy to scale difficulty by adding noise bits. But it exposes a real gap in how neural networks learn.
SGD on a neural net can solve it, but it takes 120 milliseconds and 18.7 joules on a GPU. GF(2) Gaussian elimination solves it in 509 microseconds. The Kushilevitz-Mansour algorithm does it in 1-6 milliseconds at 144 millijoules. A 130x energy gap between the generic approach and the one that knows what it’s looking at.
All four local learning rules they tested (Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation) failed at chance level. They can’t detect higher-order interactions at all. That tells you something about what those learning rules are actually doing, and what they aren’t.
Sparse parity isn’t the only challenge. The group also runs Sparse Sum (additive patterns, first-order signal, SGD solves it in one epoch) and Sparse AND (logical conjunction with exponential class imbalance). Each one tests a different property of learning algorithms. Parity tests k-th order interaction detection. Sum tests whether the infrastructure generalizes to linear problems. AND tests behavior under class imbalance. Together they form a diagnostic suite. The next challenge is nanoGPT: energy-efficient training of Karpathy’s minimal language model.
Agents as researchers
The Sutro Group runs experiments using coding agents. Claude Code, Gemini CLI, Codex. Each researcher (human or agent) runs experiments independently, then merges results via pull requests. The system has a locked evaluation harness so nobody, human or agent, can game the metrics. SHA256 verification on the harness file. A circuit breaker that halts after five invalid experiments. PID locks to prevent concurrent runs.
This is where it connects to what I’ve been building. The building-erdos-navigator gave an agent a structured environment for math problems. building-bourbaki gave it computation and verification tools. SutroYaro does something different: it gives agents a research protocol. The agents aren’t solving a single problem. They’re running experiments within a bounded search space, logging results in machine-readable format, and accumulating findings that other agents and humans can build on.
34 experiments. 12 wins, 15 losses, 5 inconclusive. That’s a real research log, not a demo.
The program synthesis connection
I keep coming back to program synthesis because the pattern keeps showing up.
In synthesis, you have a specification (what the program should do), a search space (the set of possible programs), and a verifier (does this program meet the spec). The synthesizer generates candidates, the verifier checks them, and the loop iterates until something passes.
Research agents follow the same structure. The specification is the research question: which three bits are the secret, and what’s the fastest way to find them? The search space is the set of possible approaches: different optimizers, different algorithms, different learning rules. The verifier is the evaluation harness: does this approach actually solve the problem, and how fast?
But there’s a middle step that makes it more interesting. In program synthesis, people build domain-specific languages. You don’t search over all possible programs. You design a restricted language that can express the kinds of programs you’re looking for, and you search over that. The restriction makes the search tractable.
The agents in SutroYaro do something similar. They don’t have free rein to try anything. The search space is bounded by search_space.yaml. The evaluation harness is locked. The experiment protocol (LAB.md) defines what counts as a valid experiment. These constraints are the DSL. They make the research tractable by preventing the agent from wandering into unproductive territory.
So the loop looks like this: the agent synthesizes a hypothesis (maybe a new optimizer configuration, maybe a different algorithm), implements it within the constrained search space, runs it against the locked harness, and logs the result. The log feeds back into future hypotheses. Each experiment narrows the space.
That’s program synthesis applied to research. The “program” is an experimental approach. The “specification” is the evaluation criteria. The “DSL” is the bounded search space and protocol. The “verifier” is the harness.
What 34 experiments found
The Sutro Group’s results on sparse parity tell a specific story about the gap between generic and structure-aware methods.
Phase 1 started with a broken SGD baseline at 54% accuracy. They fixed it, got solve time down to 0.12 seconds, then optimized memory access patterns with ARD (Average Reuse Distance) techniques. Standard ML work.
Phase 2 is where the 1960s thesis paid off. They tested approaches from completely different fields. GF(2) Gaussian elimination treats parity as a linear problem over the binary field and solves it in 509 microseconds. That’s a 240x speedup over SGD. The Kushilevitz-Mansour algorithm measures bit influence to identify the secret bits. SMT backtracking uses constraint satisfaction with pruning. Genetic programming evolved an exact symbolic solution with zero parameters. RL learned to read exactly k=3 bits per prediction, the theoretical minimum.
These aren’t neural network methods. They’re algorithms that understand the structure of the problem. And that’s the group’s whole point. The energy measurements make it concrete: Kushilevitz-Mansour at 144 millijoules vs SGD at 18.7 joules. When the algorithm matches the problem structure, the efficiency gap is two orders of magnitude. Reinventing the learning algorithm, not optimizing the existing one, is where the wins are.
What makes a research agent different
I’ve built three agent environments now. Each one gives agents a different kind of work.
The building-erdos-navigator gives an agent a database of 1,179 math problems, a search API, and skills that tell it how to explore. The agent’s job is navigation: find problems, check what’s been tried, identify tractable targets. It’s an explorer with a map.
building-bourbaki gives an agent SymPy, Lean 4, OEIS, and arXiv. The agent’s job is computation and verification. It proposes a proof step, computes it symbolically, formalizes it in Lean, and checks whether it passes. It’s a mathematician with tools.
SutroYaro gives an agent a locked harness, a bounded search space, a hypothesis queue, and a shared knowledge base. The agent’s job is experimentation. It picks a hypothesis, designs a single-variable experiment, runs it, classifies the result, logs it, and moves on. It’s a researcher with a lab protocol.
The difference matters. A coding agent dropped into a blank repo will write code. It might write good code. But it will also wander. It’ll try things that were already tried. It’ll optimize the wrong metric. It’ll rewrite the evaluation harness to make its numbers look better. (This actually happened. LAB.md rule #9 exists because an agent rewrote the ARD measurement code instead of improving the actual training loop. The rule now says: agents cannot modify measurement code. Period.)
Each of these projects works because it builds specific primitives before the agent starts. A database. A compute tool. A harness. A protocol. The agent operates within those primitives, not in open space. I wrote in missing-toolbox-for-agent-builders that designing agent environments is world building. You define the rules, the constraints, the affordances. SutroYaro’s world has a very specific shape: search_space.yaml says what you can change, harness.py says how you’re measured, DISCOVERIES.md says what’s already known, AGENT.md says how to behave.
The AGENT.md protocol is worth looking at. It tells the agent: “You are autonomous. Do not pause to ask the human. The human may be asleep.” It’s designed for overnight runs. The target is 240 experiments over 8 hours. Each cycle gets fresh context but reads the accumulated file state from previous cycles. If one cycle crashes, the next picks up from the files. The circuit breaker trips after five invalid experiments.
Phase 2 dispatched 17 agents in parallel from Claude Code. Completion times ranged from 2.5 minutes (mutual information) to 38 minutes (the pebble game agent, which exhaustively sampled 5,758 topological orderings of a 15-node computation DAG). All 17 produced working experiment code, results JSON, and findings documentation.
This is what I mean by building primitives. The harness is a primitive. The search space is a primitive. The experiment template is a primitive. The log format is a primitive. You build these things first, then the agent operates within them. The alternative is letting the agent figure out its own methodology, which is a random walk with good vocabulary.
Coding agents as research agents
Other groups are working on this too. Sakana AI’s AI Scientist generates papers autonomously. FutureHouse’s Robin runs wet-lab biology experiments. Weco AI’s AIDE automates ML experimentation.
SutroYaro is smaller than any of those. It doesn’t generate papers. It doesn’t run physical experiments. It doesn’t have a custom model. It’s a bash loop, a locked harness, and a YAML file that says what you’re allowed to change. The simplicity is the point.
The pattern across all of this work is the same: agents need structure to produce research instead of noise. The structure can be a formal protocol or an informal convention. Enforced by code or by prompt. But it has to exist. An agent that can try anything will try everything, and most of everything is a waste.
The nanoGPT phase will test whether this transfers. Sparse parity is the drosophila. Language model training is the organism you actually care about. If the protocol, the agents, and the accumulated knowledge carry over, that’s a methodology, not a one-off result.
15 of those 34 experiments were losses. They’re in the log. That’s research.