Featured image of post AI Agent Isolation - Sandbox Architecture for Agentic Systems

AI Agent Isolation - Sandbox Architecture for Agentic Systems

The shift to agentic AI changes the threat model of your infrastructure. This article covers the core isolation principles, the architectural controls that matter, and how they are implemented in practice.

The threat model shift

A traditional LLM is stateless and passive. It receives a prompt, returns a response. The attack surface is limited to the output; hallucinations, biased content, prompt injection through the input.

An agent is different. It has tools. It reads files, calls APIs, executes code, browses the web, spawns sub-processes. Every capability that makes agents useful also expands the blast radius when something goes wrong whether through a compromised prompt, a misconfigured tool, or a manipulated input from an external source.

The question is no longer “what does the model say” but “what can the model do and to what.”

Blast radius - the core concept

Blast radius is the scope of damage an agent can cause if compromised or manipulated.

Without containment, a single successful prompt injection such as hidden in a malicious email, a crafted file, or a poisoned web page can pivot into full infrastructure access. The agent reads a document containing hidden instructions, interprets them as legitimate commands, and begins executing actions it was never meant to take.

With proper isolation, that same attack hits a wall. The agent is contained. What it can reach, write, and call is structurally limited, not by the model’s judgment, but by the architecture beneath it.

Without sandbox With sandbox
Production DB Agent ──► Production DB Agent ──► [BLOCKED]
Secrets store Agent ──► Secrets store Agent ──► [BLOCKED]
Internal network Agent ──► Internal network Agent ──► Egress gate ──► allowlist only
Code execution Agent ──► Arbitrary exec Agent ──► Scoped tool runtime

The four controls

A properly designed sandbox enforces four distinct controls. These are not configuration options, they are architectural guarantees.

1. Network isolation with an explicit egress allowlist

The agent runtime has no direct route to the internet, production systems, or internal network by default. All outbound traffic is routed through a single egress gate that enforces an explicit allowlist.

Deny-all by default. Only endpoints explicitly listed are reachable. Everything else is blocked and logged.

Example allowlist (NemoClaw / OpenShell policy YAML):

network:
  egress:
    deny: all
    allow:
      - api.anthropic.com
      - pypi.org
      - registry.npmjs.org

This is the control that stops lateral movement and data exfiltration cold. Even if the agent is manipulated into attempting an outbound call to an attacker-controlled server, the egress gate blocks it at the network layer before any data leaves.

2. Filesystem access limited to explicitly granted folders

The agent operates on a mount-point basis. It sees only the folders you explicitly grant access to. The host filesystem, secrets directories, environment files, and system paths are invisible to the agent by default.

This means:

  • No access to ~/.ssh, ~/.aws, .env files, or credential stores
  • No traversal outside the granted scope
  • Write access limited to designated output directories

3. Tool access scoped to the current task

The agent is provisioned with only the tools required for the current task. A research agent has web search. A document agent has file read/write. Neither has shell access, database write access, or email unless explicitly required.

This is least-privilege applied at the tool level. Granting a full capability set “just in case” is how blast radii expand.

Task Tools HITL
research web_search, read_file
code review read_file, run_tests
report gen read_file, write_file
deployment deploy_api

4. Process isolation at hypervisor or kernel level

The execution environment is isolated at the OS level, not just in application code.

Two approaches exist:

Hypervisor-level (VM): The agent runs inside a dedicated virtual machine with a custom root filesystem. The VM is ephemeral, booted at task start, wiped on exit. Compromise is contained within the VM boundary. Nothing persists to the host.

Kernel-level: The agent process runs on the host kernel but is constrained by kernel-enforced policies. Syscalls are filtered, filesystem paths are restricted and network access is namespaced. Less overhead than a full VM, with narrower but enforceable boundaries.

Hypervisor isolation Kernel-level isolation
Full VM (custom Linux) seccomp (syscall filter)
Ephemeral rootfs Landlock (path enforcement)
Mount-point access only Network namespace
Host is invisible Policy YAML enforcement
Wiped on session end Persistent state across sessions

The HITL gate - where isolation meets governance

The sandbox limits what the agent can reach. But for actions with real-world consequences like database writes, deployments, access control changes, the containment alone is insufficient.

These actions require an explicit human approval step before execution.

This is not a UX choice. It is a security control.

An agent that can autonomously execute irreversible actions at scale is a force multiplier for both productivity and harm. The HITL gate ensures that speed does not come at the cost of auditability and accountability.

How it works

Every decision : approve or reject, is logged immutably. The audit trail is not optional.

Implementation examples

Anthropic - Claude Cowork

Cowork applies the sandbox model at the desktop level, targeting knowledge workers without requiring terminal access.

Isolation mechanism:

  • macOS: Apple Virtualization Framework (VZVirtualMachine); it’s a full Linux VM
  • Windows: Microsoft Host Compute System; equivalent VM isolation
  • Custom Linux root filesystem, ephemeral per session

Controls in practice:

  • Network access configured via allowlist in app Settings
  • Filesystem access limited to explicitly mounted folders
  • Session ends → VM wiped, no persistent state (excepts if using projects)
  • Claude surfaces a confirmation prompt before any destructive action (file deletion, significant writes) — the user approves or rejects before execution proceeds

Current limitation:

  • Research preview: prompt injection vulnerability disclosed at launch (white-on-white text in documents); avoid sensitive or regulated data

NVIDIA - NemoClaw

NemoClaw is an open source stack for deploying always-on autonomous agents, built on top of OpenClaw with enterprise-grade security via the OpenShell runtime.

Isolation mechanism:

  • Kernel-level: Landlock (filesystem), seccomp (syscall filter), network namespace
  • Declarative policy YAML — every constraint is explicit and auditable

Controls in practice:

  • Network egress governed by openclaw-sandbox.yaml: deny-all default, explicit allow entries
  • Filesystem constraints enforced at kernel level: agent cannot access paths outside policy
  • Every action intercepted by OpenShell before execution: if not in policy, blocked and surfaced
  • Policy updated by operator from outside the sandbox: agent cannot modify its own constraints

Current limitations:

  • Persistent state by design: always-on agents maintain memory across sessions (this is a feature, not a bug, but it means the “ephemeral filesystem” control does not apply)
  • Early preview: rough edges expected, not production-ready

What is not covered

Two controls that are frequently cited but not fully implemented in either tool today:

Indirect prompt injection detection. When an agent reads untrusted content like a web page, an email, a document, that content arrives in the same context window as the agent’s instructions. A malicious actor can embed instructions in that content to manipulate the agent. Detection requires a separate model or classifier inspecting content before it reaches the agent. This is what NeMo Guardrails addresses but it is a separate layer from the sandbox, not part of Cowork or NemoClaw directly.

Inter-agent trust boundaries. In multi-agent architectures, a compromised sub-agent can return malicious results to an orchestrator that executes them with higher privileges. Sandbox isolation at the individual agent level does not address propagation across agent boundaries. This remains an open problem in production agentic systems.

Key takeaways

The sandbox is not a feature you bolt on after deployment. It is the architectural foundation that makes agentic AI defensible in production.

The four controls: network isolation, filesystem scope, tool access, process isolation, address the structural risks introduced by agents that can act. The HITL gate addresses the governance risk of agents that can cause irreversible consequences.

No unique tools covers every control completely today. Both represent serious engineering toward the right architecture. The gap between where they are and where production-grade isolation needs to be is the work that remains.

Moving to agentic AI is the right direction. Building the containment layer first is not optional.

Sources

Claude Cowork

NVIDIA NemoClaw

NeMo Guardrails