Seedling Planted January 24, 2026 10 min read

Agent Sandboxes and the Proxmox Rabbit Hole

How my obsession with home labs led me to understand why coding agents need isolation

#claude-code #ai-agents #virtualization #proxmox #security #home-lab

This started with Proxmox.

I got obsessed with the idea of building my own cloud. Not for any practical reason at first. Just curiosity. What would it take to run my own infrastructure? VMs on demand, isolated networks, the whole thing.

If you’ve been down this road, you know how it goes. You start with “I’ll just set up one server” and end up with a home lab running Proxmox, spinning up VMs for pen testing, isolated networks for experimenting, maybe a Kubernetes cluster because why not.

Then I started using coding agents seriously. Claude Code, Cursor, the usual suspects. And something clicked.

The same problems I was solving with Proxmox are the exact problems you need to solve when you let an AI write and execute code on your machine.

The problem nobody talks about

Here’s what happens when an agent generates code:

The agent writes some Python. Maybe it’s analyzing a dataset. Maybe it’s scraping a website. Maybe it’s installing packages to do something you asked for.

That code runs on your machine. Or your server. Or somewhere.

And that code could do anything.

I don’t mean this in a scary AI-will-take-over way. I mean it literally. The agent doesn’t know (or care) that the code it wrote has an infinite loop. Or that it’s about to rm -rf something important. Or that the package it’s installing has a supply chain vulnerability.

When I ask Claude Code to write a script that processes files, it runs that script. If the script is buggy, it’s buggy on my actual filesystem. If it spawns too many processes, that’s my actual CPU.

This is different from traditional software. Traditional software gets written by humans, reviewed, tested, deployed carefully. Agent-generated code gets written and executed in the same breath. There’s no review step. The feedback loop is: generate, run, see what happens.

Why I cared about Proxmox in the first place

My original interest in Proxmox was home lab stuff. I wanted to:

Set up isolated networks to practice pen testing without touching real systems
Run experiments that might break things, without breaking my actual setup
Have VMs I could snapshot, destroy, recreate at will

The common thread: isolation. I wanted environments where I could do dangerous things safely.

Sound familiar?

The isolation spectrum

This is where my Proxmox obsession actually paid off. I’d already learned this stuff from a different angle.

There’s a spectrum of how isolated you can make code execution:

Process isolation is the weakest. Your operating system keeps processes separate. But they all share the same kernel. If something exploits the kernel, everything is compromised.

Containers (Docker, Podman) add namespace isolation. Each container gets its own view of the filesystem, network, process IDs. But they still share the host kernel. Container escapes happen. They’re not theoretical.

gVisor is Google’s approach. Instead of sharing the kernel, it implements a user-space kernel that intercepts system calls. The container thinks it’s talking to Linux, but gVisor emulates Linux in a memory-safe Go runtime. Smaller attack surface. But some performance overhead, and not everything works.

Firecracker is what AWS uses for Lambda. Each execution gets its own tiny VM with its own kernel. Hardware-enforced isolation. Even if the code exploits the kernel inside the VM, it can’t escape because the hardware (Intel VT-x, AMD-V) enforces the boundary. Boots in about 125ms with 5MB overhead.

Full VMs (QEMU, VMware, Proxmox) give you complete isolation but with more overhead. Seconds to boot, hundreds of megabytes of memory.

For home lab stuff, I used full VMs because I needed Windows guests and complex setups. For agent sandboxing, Firecracker-style microVMs hit the sweet spot: strong isolation, fast enough to not break the feedback loop.

What’s out there for agent sandboxing

A few platforms have figured this out:

Modal is what I’ve been using most. Python-first, which fits how I work. You define a sandbox in Python, execute code in it, get results back.

import modal

app = modal.App.lookup("my-sandbox", create_if_missing=True)

sandbox = modal.Sandbox.create(
    app=app,
    image=modal.Image.debian_slim().pip_install("pandas", "numpy"),
    timeout=60 * 10,
    gpu="T4"  # if you need it
)

# This runs in the sandbox, not on your machine
proc = sandbox.exec("python", "-c", "print('hello from sandbox')")
print(proc.stdout.read())

# This is fine. It's contained.
sandbox.exec("rm", "-rf", "/", "--no-preserve-root")

sandbox.terminate()

Sub-second cold starts. GPU support. Custom images. The developer experience is good.

E2B (Execute to Build) is specifically for agent code execution. They forked Firecracker and optimized it for this use case. About 150ms sandbox startup. They have a code interpreter that’s designed for LLM-generated code. 88% of Fortune 100 apparently uses them for agentic stuff, though I haven’t verified that claim.

from e2b_code_interpreter import Sandbox

with Sandbox() as sandbox:
    sandbox.run_code("x = 1")
    result = sandbox.run_code("x += 1; x")
    print(result.text)  # "2"

Fly.io runs Firecracker microVMs with a nice API. More general-purpose than the others. Good if you want control over the infrastructure.

Together AI has snapshot-based sandboxes that resume from memory state in about 500ms. Useful if you need to preserve expensive setup (installed packages, loaded models) across executions.

Quick comparison:

Platform	Startup	GPU	Main use case
Modal	~300ms	Yes	AI/ML workloads
E2B	~150ms	Limited	Agent code execution
Fly.io	~300ms	Yes	General serverless
Together AI	~500ms warm	Yes	AI development

What makes a sandbox good for agents

After running a bunch of these, here’s what actually matters:

Fast cold starts. Agents iterate. They write code, run it, see what happens, adjust. If each iteration takes 5 seconds to spin up, the feedback loop breaks. You want under 500ms cold, under 100ms warm.

Strong isolation. The code is untrusted by definition. An LLM wrote it. The LLM might have been manipulated by a prompt injection. The code might be buggy. It might be actively malicious if someone’s attacking through the agent. You need hardware-level isolation, not just containers.

Resource limits. Agents will write infinite loops. They will allocate all available memory. They will spawn processes forever. The sandbox needs hard limits on CPU, memory, disk, network, process count.

Network control. Generated code might try to exfiltrate data, attack internal services, participate in a botnet. You need egress filtering. Allowlist specific domains. Block internal network access by default.

Clean filesystem. Each execution should start fresh. No state leakage between runs unless you explicitly want it.

Observability. When something fails, you need stdout, stderr, exit codes, resource usage, timing. Otherwise debugging agent failures is impossible.

Patterns that work

Ephemeral execution: Every code run gets a fresh sandbox. Maximum isolation. No state leakage. But you pay the cold start every time.

Session-based: A sandbox persists for a conversation or task. Variables and files stick around between executions. Faster iteration. But state accumulates, and you need to be careful about cleanup.

Pooled warm sandboxes: Pre-warm a pool of sandboxes. When you need one, grab from the pool, use it, reset it, return it. Near-instant execution. But you’re paying for idle sandboxes.

Snapshot/restore: Do expensive setup once (install packages, load models), take a snapshot, restore from snapshot for each execution. Fast starts with complex environments. But snapshot storage adds up.

The Docker question

Everyone asks: can I just use Docker?

Short answer: not by itself.

Docker wasn’t designed for mutually-distrustful multi-tenancy. Container escapes are regularly discovered. If you’re running untrusted code, a bug in the Linux kernel or the container runtime can give an attacker full host access.

You can harden Docker significantly:

Use gVisor as the runtime (runsc)
Enable user namespaces
Add seccomp profiles
Add AppArmor/SELinux policies

Docker + gVisor is actually pretty good. But if you’re self-hosting, you have to keep everything updated and configured correctly. One misconfiguration and your isolation is gone.

For most people building agent products, managed services make more sense. Let someone else deal with the security updates and operational complexity.

Self-hosting vs managed

This maps to the home lab question.

Self-hosting (Firecracker, gVisor, Proxmox):

Full control
No per-execution costs
You own the security updates
You own the scaling
You own the monitoring
Requires bare-metal or nested virtualization

Managed (Modal, E2B, Fly.io):

Zero ops burden
Pay per use
Vendor handles security
Vendor handles scaling
Some lock-in
May have limitations

I run Proxmox at home for experiments. I use Modal for production agent stuff. The mental model is the same, but the operational burden is completely different.

The Proxmox connection

Here’s why the home lab experience translated:

When you run Proxmox, you think about:

What happens if a VM goes rogue?
How do I isolate networks so a compromised VM can’t attack others?
How do I snapshot before risky operations?
How do I set resource limits so one VM can’t starve the others?

When you run coding agents, you need to think about the same things:

What happens if generated code goes rogue?
How do I isolate executions so a malicious script can’t attack my system?
How do I checkpoint before risky operations? (git commit, basically)
How do I set limits so one bad script can’t exhaust resources?

The tools are different. The mental model is the same.

If you’ve ever set up VLANs in Proxmox to isolate a pen testing lab, you understand why agent sandboxes need network egress filtering. If you’ve ever had a VM fork bomb take down your host, you understand why agent sandboxes need process limits.

What’s coming

WebAssembly might be interesting. Near-native performance, strong sandboxing, capability-based security. WASI (the system interface) is still maturing, but it could be a middle ground between containers and microVMs.

Confidential computing (AMD SEV, Intel TDX) lets you run VMs where even the hypervisor can’t read guest memory. Privacy-preserving agent execution. Running agents on infrastructure you don’t fully trust. Cryptographic attestation that the sandbox is configured correctly.

Language-level sandboxing like Deno’s permission system. Weaker than OS/VM isolation but might be sufficient for some cases.

If you’re building an agent product:

Use E2B or Modal. They’re built for this.
Don’t roll your own unless you have specific requirements.
Budget for maybe $0.001-0.01 per sandbox execution.

If you’re an enterprise with compliance requirements:

Consider self-hosted Firecracker.
Add gVisor to existing Kubernetes.
Log everything.

If you’re like me and just want to experiment:

Proxmox or UTM locally for playing with VMs.
Docker + gVisor for quick iteration.
Try Modal or E2B’s free tiers to see how managed feels.

Either way:

Assume agent-generated code is malicious.
Assume sandbox escapes are possible. Defense in depth.
Monitor for weird resource usage.
Keep everything updated.

Where this leaves me

I still run Proxmox at home. I still spin up VMs for experiments. But now I understand why that matters beyond just home lab hobbyism.

The isolation problem is the same problem. Whether you’re running a pen testing lab or running code that an LLM wrote, you need to contain the blast radius.

The difference with agents is scale. An agent might execute thousands of code snippets in a conversation. You can’t manually review each one. The sandbox has to be the safety net.

I spent years learning about virtualization because I thought it was interesting. Turns out it was training for something I couldn’t have predicted.

Or maybe I’m just justifying the money I spent on home lab hardware. That’s also possible.

References

Platforms:

Technologies:

Firecracker (NSDI ‘20 paper worth reading)
gVisor
Proxmox VE

Background:

Building Effective Agents (Anthropic)
AWS Lambda Security Overview