Budding Planted May 10, 2026 22 min read

Building Chimera: A Coding Agent Framework, Built by Coding Agents

How building the same agent three times led me to decompose coding agents into composable primitives, and why I didn't write most of the code

I didn’t write most of this code.

Chimera is about 144,000 lines of Python at v0.6.0. 3,600 tests, 8 architectural layers, 47 tools, 8 reasoning loops, 7 LLM providers plus a compatible-mode router for vLLM, SGLang, DeepSeek, and Grok. It ships 4 agent presets that reproduce SWE-Agent, Aider, Cline, and Codex, plus seven sibling CLIs — otter, ferret, shrew, weasel, mink, stoat, badger — each a different agent shape built from the same primitives. I released the alpha in March and v0.6.0 this week.

My workflow for building it: write a spec in the morning. Describe the module, the API shape, the constraints, what the tests should verify. Hand it to the agent. Go do something else. Come back the next day. Review what happened. Make corrections. Write the next spec. Not hour-to-hour. Day-to-day.

I’ve been trying to make computers write code since 2016. Hopper was an early prototype. The Desc2Code work with OpenAI was neural code generation before the LLM era. Over the years I went deep on the formal side of program synthesis — DSLs, CEGIS loops, oracles, verifiers — the ideas that make synthesis actually work. Chimera is where all of that lands.

I wrote about this shift in coding-agents-made-me-better-programmer: “I’m mostly programming in English now.” Chimera is the most extreme version of that. About 144,000 lines of Python and I typed a fraction of them directly. The rest came from agents working off specs I wrote. A framework for building coding agents, built by coding agents. That’s the proof of concept right there.

Every agent I built was the same agent

I didn’t see this until I was three projects deep.

The Erdős Navigator gives a coding agent a database of 1,179 math problems, a REST API, and skills that tell it how to explore. Bourbaki gives a coding agent SymPy, Lean 4, OEIS, and arXiv. SutroYaro gives coding agents a locked evaluation harness, a bounded search space, and a research protocol.

Different domains. Different tools. Same machinery underneath.

Each one had an LLM backend. Each one had a set of tools. Each one had a reasoning loop that decided what to do next. Each one had an environment where code actually ran. I kept rebuilding these four components from scratch every time. The provider. The tools. The loop. The environment.

Then I looked at the coding agents I was actually using day to day. Claude Code. Codex. Aider. Cursor. Same four components. Just configured differently.

SWE-Agent is built around a specific Agent-Computer Interface with a syntax-checking editor and file navigator, running in Docker. Aider is a lint feedback loop with tree-sitter parsing. Cline is a plan-then-act loop with token budgeting. Codex is a kernel-sandboxed environment (Seatbelt on macOS, Landlock on Linux) in Rust. Each one made specific choices about those four primitives and hardcoded them into a monolith.

Once you see the pattern, every coding agent starts looking like a configuration file.

And the thing I keep arriving at: coding agents aren’t just one type of AI agent. They’re the base type. A research agent is a coding agent that reads papers and runs experiments instead of editing source files. A math agent is a coding agent that calls SymPy and Lean instead of running tests. A DevOps agent is a coding agent that touches infrastructure instead of application code. The loop is the same. The tools change. I proved this to myself three times without realizing it. Erdős Navigator, Bourbaki, SutroYaro. Each one started as a coding agent. I gave it different tools and a different environment and it became a different kind of agent.

The infrastructure for coding agents is the infrastructure for all agents. Get it right for coding and you get it right for everything else.

The monolith problem

Every major coding agent is a monolith.

Claude Code is closed-source. I use it every day and I can’t see how it works internally. Codex CLI is Rust, tightly coupled to OpenAI’s sandbox stack and their GPT-5.4 models. Aider is 50,000+ lines of Python organized around one specific workflow. If you want to understand how any of them work, you reverse-engineer them. If you want to build your own agent, you start from scratch.

This is what every field looks like before decomposition happens. Custom stacks everywhere. No shared vocabulary. Results you can’t reproduce because the infrastructure isn’t shared. I watched this happen in deep learning before frameworks existed. Every lab had their own training loops, their own optimizer code, their own data loading. Then frameworks broke the problem into composable pieces — layers, optimizers, callbacks — and labs stopped rewriting training loops.

Coding agents are at that same point. There’s no shared vocabulary for what a coding agent actually is. No composable abstractions. When someone publishes “77.8% on SWE-bench,” you can’t verify it without running their exact proprietary scaffold. When a startup builds an internal coding agent, they rebuild the file editing, the git integration, the permission system, the context management, all of it, from the ground up.

Decomposing the agent

Chimera decomposes coding agents into composable primitives. Providers, tools, loops, and environments.

The core equation:

Agent = Provider + Tools + Loop + Environment
flowchart LR subgraph AGENT["Coding Agent"] direction TB P["<b>Provider</b><br/>LLM backend<br/>(Anthropic, OpenAI,<br/>Google, Ollama, ...)"] T["<b>Tools</b><br/>What the agent can do<br/>(read, write, bash,<br/>search, git, ...)"] L["<b>Loop</b><br/>How the agent reasons<br/>(ReAct, PlanAndExecute,<br/>Reflexion, ToT)"] E["<b>Environment</b><br/>Where code runs<br/>(Local, Docker, Git,<br/>Remote, Sandbox)"] end P -.composes.-> AGENT_OUT(( )) T -.composes.-> AGENT_OUT L -.composes.-> AGENT_OUT E -.composes.-> AGENT_OUT AGENT_OUT --> RESULT["Agent behavior"] classDef prim fill:#f4f0e8,stroke:#333,stroke-width:1px,color:#111; classDef out fill:#eef,stroke:#333,stroke-width:1px,color:#111; class P,T,L,E prim; class RESULT out;

A provider is which LLM you’re talking to. Seven supported: Anthropic, OpenAI, Google, Ollama, Modal, OpenAI Responses API, and any OpenAI-compatible endpoint. Swap Claude for GPT-4 or a local model without touching anything else.

Tools are what the agent can do. 47 built-in (55 Tool classes): file operations (read, write, edit, multi-edit, apply_patch, write_guard, notebook_edit, rollback), execution (bash, test, verify, ipython, powershell), code analysis (search, git, repo_map, import_graph, definition_lookup, codebase_index, embedding_index), worktree management (worktree_tool, cron_tools), web access (web_search, web_fetch, browser), reasoning (think, delegate, ask_user, skill_tool). Add your own by implementing the tool interface.

Loops are how the agent reasons. Eight built-in. Four conceptual: ReAct (reason, act, observe), PlanAndExecute (plan first, execute steps), Reflexion (execute, reflect, improve), TreeOfThought (explore multiple reasoning paths). Four more for agent-specific patterns: RetryLoop, LintFeedbackLoop, PlanActLoop, Autonomous. The loop determines the cognitive pattern. A retry loop produces different behavior than a plan-then-act loop even with the same tools and the same model.

Environments are where code runs. Local filesystem, Docker containers, Git repositories, remote machines, cloud instances. The environment determines what “running the code” means and what isolation guarantees you get.

These compose into an 8-layer stack:

flowchart TB L8["<b>Layer 8 — CLI</b><br/>chimera code / synthesize / eval / review / ci-fix / fs"] L7["<b>Layer 7 — Workflows</b><br/>CIFix · Review · Research · Migration · DocGen · TestGen"] L6["<b>Layer 6 — Synthesis</b><br/>Trainer · Strategy · Spec · Architecture"] L5["<b>Layer 5 — Evaluation</b><br/>Harness · Metrics · Benchmarks · Trace"] L4["<b>Layer 4 — Agent</b><br/>Agent · Tools · Loops · Prompt · Context"] L3["<b>Layer 3 — Provider</b><br/>Anthropic · OpenAI · Google · Ollama · Modal · OAI-compat"] L2["<b>Layer 2 — Infrastructure</b><br/>Security · Permissions · Events · Sessions · MCP · LSP"] L1["<b>Layer 1 — Environment</b><br/>Local · Docker · Git · Remote · Cloud · Sandbox"] L8 --> L7 --> L6 --> L5 --> L4 --> L3 --> L2 --> L1 classDef top fill:#eef6ff,stroke:#333; classDef mid fill:#f4f0e8,stroke:#333; classDef bot fill:#eef8ee,stroke:#333; class L8,L7,L6 top; class L5,L4,L3 mid; class L2,L1 bot;

Each layer works independently. Use Layer 1 for a sandboxed environment without building an agent. Use Layer 3 to talk to an LLM without any tools. Use Layer 4 for a full agent without the synthesis framework. They compose upward but don’t require what’s above them.

Layer 2 is where a lot of the real complexity lives. The stuff that separates a demo from a tool you can actually run. Permission systems with risk-based access control. Security analysis (both rule-based and LLM-based). Secret detection so credentials don’t leak into context. Event-sourced sessions for persistence. Context compaction so the agent doesn’t choke on its own history. Ghost commits for undo without polluting your git log. These are the things every monolith has to build and every monolith buries inside its own codebase. Chimera makes them explicit and reusable.

In practice, three levels of usage:

# Level 1: One-liner
result = chimera.synthesize("Build a REST API for tasks", tests="./tests/")

# Level 2: Configured
agent = chimera.Agent(
    provider=chimera.create_provider(model="claude-sonnet-4-20250514"),
    tools=list(chimera.DEFAULT_TOOLS),
    loop=chimera.ReAct(max_steps=50),
)
result = agent.run("Fix the failing test", env=chimera.LocalEnvironment("."))

# Level 3: Framework author
trainer = chimera.Trainer(
    spec=chimera.Spec.from_tests("./tests/", "Build a task manager"),
    agent=my_agent,
    env=chimera.LocalEnvironment("."),
)
result = trainer.synthesize(strategy=chimera.TestConvergence(max_iterations=10))

Level 1 is for people who want to use it. Level 3 is for people who want to build on it. The same primitives serve both.

Four agents, one framework

If every coding agent is a combination of the same primitives, you should be able to recreate any of them by plugging in the right components. Chimera ships with 4 preset configurations that reproduce known agent architectures:

AgentLoopDefining trait
SWE-AgentRetryAgent-Computer Interface, Docker sandbox
AiderLint feedbackTree-sitter parsing, git integration
ClinePlan-then-actToken budgeting, approval prompts
CodexReActKernel sandboxing, tight model coupling

These aren’t wrappers around those tools. They’re independent implementations that reproduce the same architectural decisions using Chimera’s primitives. SWE-Agent’s architecture is a retry loop with minimal tools in a Docker environment. Build that in Chimera by specifying those three things. Then change one variable. What if SWE-Agent used a plan-then-act loop instead of a retry loop? What if Aider ran in Docker instead of the local filesystem? What if you gave Cline’s token budgeting to Codex’s tool suite?

Those become one-line experiments instead of fork-a-50K-line-repo experiments.

flowchart LR subgraph BEFORE["Monolith world"] direction TB B1["Fork SWE-Agent (~50K LOC)<br/>replace env<br/>re-test<br/>weeks of work"]:::pain end subgraph AFTER["Chimera world"] direction TB A1["Aider in Docker<br/>(env swap, 1 line)"]:::ok A2["SWE-Agent with Reflexion<br/>(loop swap, 1 line)"]:::ok A3["Codex with Cline-style<br/>token budgeting"]:::ok A4["Any combination from<br/>the 4 primitives"]:::ok end BEFORE ==>|decompose| AFTER classDef pain fill:#f8dcdc,stroke:#a33; classDef ok fill:#dff5df,stroke:#1a7;

Claude Code will always be better at being Claude Code than Chimera’s Codex preset. The value is in making architectural decisions visible and swappable. You can study what makes each agent work, modify one piece at a time, and find combinations nobody has tried. That makes agent architecture a research question you can actually run experiments on.

I extended this comparison to seven agents in a workshop draft on composable primitives for coding agents. Six of them ship exactly one loop. Chimera ships eight. Provider counts span from one (Codex, hardcoded to OpenAI) to twenty-two (OpenCode). The composition space, multiplied out, is roughly 7 × 47 × 8 × 5 × 10, about 130,000 combinations. Four are the named presets. Six are the factory-built agents. Seven are the shipped sibling CLIs (next section). The rest is unexplored.

Beyond the 4 presets, Chimera ships a full CodingAgent assembly. It’s a production coding agent built from these same primitives, with system prompts derived from studying 6 real coding agents. Plan mode, read-before-write discipline, auto-continue when the agent stalls. None of this required changing the underlying primitives. It’s composition on top of the same providers, tools, loops, and environments.

Seven sibling CLIs

v0.6.0 ships seven coding-agent CLIs alongside the main chimera orchestrator. Each is a different agent shape built from the same primitive set.

CLILoCShape
otter17,590claude-code-style. Snapshots, file-undo, declarative permission rules, ACP, MCP, plugins, share commands, PTY.
ferret10,264codex-style. apply / review / fork / mcp-server / bridge subcommands, OS sandbox, IDE bridge.
shrew8,574small-local-model tuned. Model profiles, output parsing, quality monitoring, skill injection.
weasel6,524programmatic harness. Four operating modes (REPL, print, RPC, SDK), Node JS/TS executor, JSON stream, @file expansion.
mink4,593team orchestrator. Cost gating, run management, hook event wiring.
stoat4,142hooks-disciplined. 18 lifecycle events, plan mode, shell mode, bracketed paste, keybindings.
badger3,375claw-code parity. Explicit parity matrix, claw-code slash palette.

All seven share the same tools/, providers/, core/loops/, and agents/ modules. Per-CLI provider files are 191-635 lines of configuration and wiring on top of the shared chimera.providers package — they pick and surface providers, they don’t reimplement them.

Three of the seven are explicit reconstructions of monoliths I studied in the workshop draft. otter reconstructs Claude Code’s architectural shape. ferret reconstructs Codex’s subcommand surface. badger reconstructs claw-code’s slash palette and parity matrix. Built from Chimera primitives, no forking, no duplication of provider or tool implementations.

The other four fill design-space cells the existing monoliths leave open or only hint at. shrew takes Kimi-CLI’s local-model focus and adds explicit quality discipline. mink extends cc_source’s team-orchestration shape with cost gating. weasel is the programmatic and headless harness with four operating modes — none of the seven monoliths I studied have a clean instance of this. stoat is the hooks-disciplined workflow agent — 18 lifecycle events, plan and shell modes — also unfilled in the field today.

One monolith has no Chimera reconstruction. pi-mono’s ethos is reject MCP, extend via npm packages. Chimera is MCP-first. The two stances are architecturally orthogonal, so there’s nothing to reconstruct. That’s an honest gap, not a missing item.

The numbers underneath: Chimera at v0.6.0 is 144,000 lines of Python. 89,000 of them (62%) are shared primitive and infrastructure code — tools/, providers/, core/, agents/, compaction/, permissions/, mcp/, lsp/, eval/, hooks/, sessions/, plugins/. The seven CLI directories add up to 55,000 lines (38%) of shape-specific wiring, slash palettes, REPL surfaces, mode toggles, and CLI ergonomics. Producing equivalent shape-diversity inside any of the seven monoliths I studied would mean forking seven repositories. Chimera produces it through configuration.

Chimera isn’t another coding agent in the category of the seven I compare against. It’s the design space those agents occupy, factored into composable primitives, with seven shipped instantiations to prove the enumeration is real. Three reconstructions of the existing field. Two extensions of cells the closest monolith hints at. Two new cells nobody has built. The seven-by-seven mapping is the architectural argument made concrete.

That’s the bet. Not one more coding agent. The space those agents occupy, factored.

The synthesis underneath

Chimera’s deepest layer comes from program synthesis, not from the agent ecosystem.

The core verb is .synthesize(), not .generate() or .create(), because that’s what it actually is. The search engine changed from enumerative search to constraint solving to LLMs, but the structure is the same: specify what you want, search for a program that satisfies the spec, verify the result.

If you’ve spent time with synthesis research, the mapping is direct. In classical synthesis you have a DSL that constrains the search space, a synthesizer that proposes candidates, a verifier that checks them, and an oracle that provides the ground truth. In Chimera:

  • DSL → Tools + Environment. The tools and environment are the grammar. They define what the agent can express. A coding agent with file editing and bash in a Docker container is searching a different program space than one with browser automation and LSP on a local filesystem. The constraint is the design, the same way a DSL in FlashFill is deliberately impoverished to make synthesis tractable.
  • Synthesizer → Agent + Loop. The agent proposes candidates. The loop determines how: ReAct tries one thing at a time, TreeOfThought explores multiple paths, Reflexion learns from its own mistakes. These are search strategies over the program space.
  • Verifier → Test suite + Harness. Tests check whether the candidate satisfies the spec. The evaluation harness runs them and reports pass/fail. Same role as the verifier in a CEGIS loop.
  • Oracle → Spec. The spec defines what “correct” means. In classical synthesis the oracle might be a logical formula or input-output examples. In Chimera it’s usually tests, but it can be any success criteria the Trainer can evaluate.
flowchart LR subgraph CLASSICAL["Classical Synthesis (CEGIS)"] direction TB DSL["DSL<br/>(search space grammar)"] SYN["Synthesizer<br/>(enumerate / SMT / guided)"] VER["Verifier<br/>(logical check)"] ORA["Oracle<br/>(spec: I/O pairs, formula)"] DSL --> SYN --> VER --> ORA ORA -.counterexample.-> SYN end subgraph CHIMERA["Chimera Primitives"] direction TB TE["Tools + Environment<br/>(agent grammar)"] AL["Agent + Loop<br/>(ReAct, ToT, Reflexion)"] TH["Test Suite + Harness<br/>(pass/fail + trace)"] SP["Spec<br/>(tests, success criteria)"] TE --> AL --> TH --> SP SP -.failure category.-> AL end DSL ==>|grammar| TE SYN ==>|search| AL VER ==>|verify| TH ORA ==>|spec| SP classDef left fill:#f6ecec,stroke:#333; classDef right fill:#ecf2f6,stroke:#333; class DSL,SYN,VER,ORA left; class TE,AL,TH,SP right;

The CEGIS pattern shows up directly: the agent synthesizes a candidate, the tests verify it, failures become counterexamples that guide the next attempt. TestConvergence, the simplest strategy, is literally this loop — iterate until tests pass. CurriculumStrategy adds progressive difficulty, starting with easy requirements and adding harder ones. EnsembleStrategy runs multiple agents and takes majority vote. TreeSearch explores multiple solution paths and backtracks on failures.

One thing that’s become clear running benchmarks: pass/fail isn’t enough. You need to know WHY the candidate failed. Chimera’s trace capture classifies failures into six named categories at chimera/eval/trace.py:104-128: NO_TOOL_CALLS (never invoked tools), EXPLORE_ONLY (read and searched but never edited), NO_EDITS (used tools but skipped edit/write), EDIT_FAILURES (edits applied but threw errors), WRONG_PATCH (patch applied but tests still failed), MAX_TURNS (hit turn limit mid-task). That’s not abstract. It’s a working taxonomy of the ways an agent actually fails on SWE-bench. Classical CEGIS gives you a counterexample. This gives you a category. The category tells you whether the fix is to change the prompt, swap the loop, extend the turn budget, or rethink the task decomposition. Diagnosis is a layer above verification, and in practice it’s where the real iteration happens.

The Trainer coordinates all of this. You give it a spec, an agent, and a strategy. This connects to the SutroYaro work too. In that project, agents ran experiments within a bounded search space, logging results that other agents could build on. Chimera’s Trainer does the same thing for code. The spec is the bound. Each iteration is an experiment. The strategy decides where to search next.

The rest of the synthesis pipeline maps onto Chimera’s infrastructure layers. The feedback/repair loop is the agent loop itself — ReAct observes results, Reflexion reflects on failures, the ExactRepeatDetector prevents the agent from looping on the same fix. Context and retrieval live in Layer 2: repo_map and import_graph for code retrieval, context compaction for dynamic summarization, event-sourced sessions for persistence. Tool integration is Layer 1 plus the 47 tools: bash sandbox for execution, LSP for static analysis, browser and web search for external knowledge.

The full synthesis pipeline — specification, search space, generation strategy, verification, feedback, context, tools — maps onto Chimera’s 8 layers. That’s not a coincidence. It’s the design.

I wrote about this on programsynthesis.pub: the insight that makes synthesis practical is not a better search algorithm, it’s a smaller language. Constrain the search space and the combinatorics become tractable. Chimera’s tools and environments are that constraint. They’re the DSL for coding agents.

What’s missing: Chimera doesn’t have formal DSL/grammar definitions — the search space constraints are implicit through tool and environment choices, not explicit grammars you can reason about statically. There’s no formal verification integration (Z3, Lean — that lives in Bourbaki, not here yet). No programming-by-example mode where the spec is I/O pairs in the FlashFill sense. And no neural-guided search in the DeepCoder sense — the LLM is the neural guide, but there’s no learned policy over a grammar. These are on the roadmap. The composability argument is that you could add any of them as a tool or strategy without changing the architecture.

Benchmarks, honest version

BenchmarkScoreModel
HumanEval (164 problems)66.5% pass@1GLM-5.1
SWE-bench Lite (20 instances)10%GLM-5
Terminal-Bench (10 tasks)30%GLM-5

I previously reported 90.9% on an earlier GLM-5; the raw data is lost. The number above is what re-runs against the live data file at data/humaneval-glm51-results.json. Numbers age. Posts should age with them.

HumanEval is decent. SWE-bench is not. I’m putting these numbers out there because the industry has a problem with cherry-picked benchmarks. When a lab says “77.8% on SWE-bench,” you can’t verify it without their proprietary scaffold. Chimera documents 13 known issues affecting benchmark performance. They’re in the repo. I’d rather ship honest numbers than impressive ones.

The gap between HumanEval and SWE-bench tells you something real. HumanEval problems are self-contained functions. SWE-bench problems require understanding an existing codebase, finding the right file, making the right edit, not breaking anything else. The gap is the distance between writing code and engineering software.

The gap is also where the research is. Which loop, which tools, which context management strategy, which prompting gets you from 10% to 70%? Chimera makes that question answerable by making every variable swappable. The primitives exist. The right composition hasn’t been found yet.

How this actually got built

I used to think building with coding agents meant sitting in the terminal, watching the agent work, course-correcting in real time. That’s how I started with Claude Code a year ago. Watch it, nudge it, approve every diff.

Chimera was different. The project was too big to babysit.

The workflow I settled into: mornings I’d write specs. What the module should do. What the API should look like. What invariants the tests should check. I’d describe constraints, point at related modules for style reference, and hand it off. Then I’d go do other things. Come back the next morning.

What I’d find: the agent had written the module, tests, sometimes docs. Sometimes the code was clean and I had nothing to add. Sometimes it made assumptions I didn’t intend, overengineered a simple thing, or solved a problem I didn’t ask it to solve. On those days I’d rewrite the spec to be more precise and run it again.

The hardest part wasn’t the code. It was learning to write specs that left less room for misinterpretation. I got better at this over weeks. The specs got shorter and more specific. The corrections got smaller. By the end I was spending more time reviewing than specifying, which felt like the right ratio.

The thing about daily check-ins instead of hourly: it changes what you optimize for. When you’re babysitting, you optimize for the current file. When you step back, you think about module boundaries, API design, how pieces fit together. You think at the architecture level. I wrote in coding-agents-made-me-better-programmer that agents shifted me from “how do I write this function” to “how should this system be designed.” Chimera was that shift taken to its conclusion.

Some numbers, roughly: I probably typed 10-15% of the code directly. The rest came from agents working off my specs. My time went to architecture decisions, spec writing, code review, and the occasional manual fix where the agent kept getting stuck on something. The overall build time was measured in weeks. I can’t give exact hours because it wasn’t a clean timeline and I was building other things in parallel. But the ratio of output to my direct input is lopsided in a way that still surprises me.

I also want to be clear: the agents didn’t architect this. I did. The 8-layer decomposition, the choice to model synthesis as training, the decision about which primitives matter and which don’t. Those came from years of building agents and thinking about program synthesis. The agents implemented the architecture I described. They didn’t come up with it. The difference between “agents can implement” and “agents can architect” is still large. Maybe that gap closes. Right now it’s real.

Where this leaves me

I keep building things from the same obsession. Making computers write code. Hopper was an early prototype. My program synthesis work was the theory. Erdős Navigator was a search environment. Bourbaki was a compute-verify loop. SutroYaro was a research protocol. Chimera is the framework layer that all of those needed and didn’t have.

The name is a nod to program synthesis, which is what the framework is really doing under the hood, and to the mythological chimera — something stitched together from parts that shouldn’t normally combine.

The bet is that coding agents go from monoliths to shared frameworks to an ecosystem of interchangeable components. We’re between the first phase and the second. The decomposition has to happen. You can’t scale an ecosystem on monoliths. Every field that matured went through this. Custom implementations, then shared abstractions, then the abstractions enabled combinations nobody would have tried by hand.

The gap between having the primitives and getting competitive results is real. Chimera has providers, tools, loops, environments, and synthesis strategies. The question is which combination gets you from 10% to 70% on SWE-bench with an open model. The primitives existed in deep learning for years before someone found the right composition. That’s a research problem now, not an infrastructure problem. Chimera is the infrastructure that makes the research tractable.

One thing I didn’t expect: the patterns Chimera discovered turned out to be portable. I extracted 14 of them — retry-loop, test-convergence, smart-compaction, ghost-commits, plan-act, investigate-first, lint-feedback, and others — and packaged them as Claude Code skills in a Chimera plugin. They work there too, outside the framework they came from, which tells me the patterns have value on their own. The framework is scaffolding for patterns that can live without it.

Chimera is open source, MIT-licensed, and in alpha. It has about as many known issues as working features. That feels right for a v0.1.

Or maybe someone else builds it better and I use that instead. I’d be fine with that too.

  • GitHub — Source, docs, examples
  • pip install chimera-run — PyPI package