Sandboxing AI Agents: Beyond Isolation

There's a growing interest in sandboxing AI agents. The idea is straightforward: if you can't trust what the agent will do, restrict what it can do at the operating system level. Lock down the filesystem. Block network access. Prevent dangerous commands. The agent runs in a cage.

This is good instinct. It's the right direction. But having built and shipped a sandbox system that uses kernel-level enforcement, I want to be honest about what it does and doesn't solve.

How kernel sandboxes work

First, the mechanics. Most people in this space have heard of containers and VMs. Those are coarse isolation boundaries. A kernel sandbox is finer-grained. It restricts what a single process can do without requiring a separate container or virtual machine.

On Linux, the primary mechanism is Landlock. It's a Linux Security Module that has been in the mainline kernel since version 5.13 (2021). Landlock allows unprivileged processes to restrict their own access rights. Once applied, the restrictions cannot be relaxed. A process can drop capabilities but never regain them.

On macOS, the equivalent is Seatbelt (sandbox-exec). Apple uses it internally for almost every system service. It's profile-based: you define what's allowed, and everything else is denied.

Both share a critical property: they're irreversible. Once a sandbox is applied to a process, it stays until the process exits. The agent can't undo it. Malicious code injected into the process can't undo it. The kernel enforces it. This is different from application-level restrictions, which can be bypassed by the application itself.

Libraries like nono-py make this accessible from Python. Two lines of code and the process is locked down. No container. No VM. Just kernel enforcement on the existing process.

What a sandbox gives you

A properly configured kernel sandbox provides real protection against several classes of agent misbehavior:

Credential theft. An agent that's been prompt-injected into trying to read ~/.aws/credentials or ~/.ssh/id_rsa will fail. The kernel denies the file access. The agent sees a PermissionError. The attack surface shrinks dramatically.

Destructive commands. Blocking rm, dd, chmod, sudo, and similar commands at the kernel level means that even if the agent generates and attempts to execute a destructive command, the OS refuses to run it. This is not an application-level blocklist that can be circumvented with creative shell syntax. It's enforced by the kernel.

Lateral movement. If network access is restricted to localhost only, a compromised agent cannot phone home, exfiltrate data to an external server, or pivot to other systems on the network. It can only talk to the local TapPass proxy, which means all its model calls still go through the governance pipeline.

Filesystem containment. Restricting write access to a single workspace directory means the agent can't modify application code, overwrite configuration files, or write to system directories. Even if it's tricked into trying.

This is meaningful. Against a class of attacks where the agent is manipulated into performing local system operations, a kernel sandbox is the strongest defense available. Stronger than application-level checks, stronger than Docker (which still allows most filesystem operations within the container), and stronger than any Python-level restriction.

What a sandbox doesn't give you

Here is where I want to be careful, because overpromising on sandboxes would be irresponsible.

It doesn't govern the model conversation

A sandboxed agent can still send any prompt to the model. It can still receive any response. The sandbox restricts what the process can do locally. It says nothing about what the agent says, asks, or reasons about over the network.

If an agent is tricked into including customer PII in a prompt sent to an external model provider, the sandbox won't prevent it. The network call to the model API is allowed (it has to be, or the agent can't function). The data leaves the process. The sandbox is irrelevant to this threat.

It doesn't understand tool semantics

An agent calls tools. Some tools are HTTP requests to internal APIs. A sandbox that allows localhost network access will allow all of these equally. It can't distinguish between "query the CRM for the customer's order status" and "query the CRM for all customer records and export them." Both are HTTP requests to localhost. Both are allowed by the sandbox.

The sandbox operates at the syscall level. It sees file paths, network addresses, and process executions. It doesn't see API semantics, data content, or business logic. These are fundamentally different abstraction layers.

It doesn't enforce policy

A sandbox is a fixed perimeter. It's set when the process starts and doesn't change. But governance policy is dynamic. A CISO might decide that an agent should lose access to financial data after business hours. Or that a particular tool should be disabled because of a newly discovered vulnerability. Or that agents in the insurance division need stricter data handling than agents in marketing.

Sandboxes don't adapt to policy changes. They're binary: applied or not. The process needs to restart to get a new sandbox configuration. Runtime policy changes require a different mechanism.

It doesn't produce compliance evidence

A sandbox prevents bad things from happening. It doesn't record what happened. It generates no audit trail, no compliance evidence, no proof that policies were enforced. If an auditor asks "what did the agent do last Tuesday?", the sandbox has no answer. It wasn't logging. It was blocking.

Prevention without evidence is useful for security. It's insufficient for compliance.

Two layers, not one

This is the core argument I want to make: you need two distinct layers, and they do different things.

🔒

Layer 1: Kernel sandbox (process level)

Restricts what the OS will allow the agent process to do. Filesystem, network, commands. Static, irreversible, enforced by the kernel. Protection against local system exploitation.

📋

Layer 2: Policy engine (API level)

Inspects what the agent is trying to do with models and tools. Data classification, tool authorization, budget enforcement, anomaly detection. Dynamic, policy-driven, produces audit evidence.

The sandbox catches the agent trying to read ~/.ssh/id_rsa. The policy engine catches the agent trying to send customer health records to GPT-4. Different threats. Different layers. Both necessary.

What makes this work in practice is connecting the two layers so the sandbox configuration is driven by the same policy that governs API-level behavior. The CISO defines a policy. That policy generates both the sandbox rules (which tools, which filesystem paths, which network access) and the API-level rules (which models, which data classifications, which budget limits). One policy, two enforcement layers.

How this works concretely

Let me show what this looks like in code, because the abstraction is only useful if it's practical.

from tappass import Agent

# Connect to the governance proxy
agent = Agent(
    "https://tappass.example.com/v1",
    api_key="tp_claims_agent_7f2a..."
)

# Apply kernel sandbox. Irreversible.
agent.secure(workspace="./claims_data")

# From this point:
# - Filesystem: only ./claims_data (rw) and cwd (ro)
# - Credentials: ~/.aws, ~/.ssh, etc. → blocked
# - Commands: rm, sudo, dd → blocked
# - Network: only localhost (TapPass proxy)
# - API calls: governed by pipeline policy
# - Data: classified in real time, PII redacted
# - Budget: enforced per session
# - Audit: every action logged with hash chain

response = agent.chat("Process claim #4472")

One method call applies the sandbox. The sandbox configuration isn't hardcoded. It's fetched from the TapPass server, which means it's driven by the CISO's pipeline configuration for that agent's identity. Different agents get different sandbox rules. A claims processing agent gets access to the claims directory. A reporting agent gets read-only access to the analytics directory. The agent developer doesn't decide what the sandbox allows. The policy does.

The sandbox also watches for policy changes. If the CISO updates the sandbox configuration on the server, the agent detects the change on the next polling interval and logs a warning. The new configuration takes effect on the next restart. This is a practical compromise: sandbox rules can't change at runtime (the kernel enforces that), but the agent is aware that its rules are stale.

The policy-driven sandbox

The key insight is that the sandbox rules should not be defined by the developer. They should be defined by the governance policy and derived from the agent's identity and classification.

Consider a policy that says:

Agents classified as "high-risk" get sandbox mode: filesystem restricted to workspace, network restricted to TapPass proxy only, all dangerous commands blocked.
Agents classified as "internal-tool" get sandbox mode: filesystem restricted to workspace plus config directory (read-only), network allowed to internal services, dangerous commands blocked.
Agents classified as "low-risk" get sandbox mode: filesystem restricted to workspace, network open, commands unrestricted.

The developer doesn't need to know these rules. They call agent.secure() and the correct sandbox configuration is applied based on who the agent is and what the CISO has decided. If the CISO tightens the policy, the next deployment gets tighter restrictions. No code change required.

This is what I mean by "policy-driven sandbox." The sandbox is a mechanism. The policy is the decision. Separating the two means you can change governance without changing code, and you can enforce governance on agents whose developers didn't think about security.

Practical limitations to be honest about

A few things that are worth stating plainly:

Sandboxing requires the agent to opt in. The agent process calls secure(). If the developer doesn't include that call, there's no sandbox. You can mandate it through code review, CI/CD checks, or by making it a default in your agent template. But it's not automatic. This is a limitation of process-level sandboxing: the process restricts itself.

Not everything runs on Linux or macOS. Landlock requires kernel 5.13+. Seatbelt is macOS only. Windows has no equivalent mechanism at the same granularity. If your agents run on Windows, kernel sandboxing is not currently an option. Application-level restrictions are the fallback, with all their limitations.

Sandboxes can break legitimate functionality. If your sandbox blocks network access to external services and the agent needs to call a third-party API, the agent breaks. Getting the configuration right requires understanding what the agent actually needs. Too permissive and the sandbox is ineffective. Too restrictive and the agent can't do its job. This is the same tradeoff as any least-privilege system, and it requires iteration.

The sandbox doesn't help with the most common AI agent risks. Prompt injection, data leakage through model calls, hallucinated tool arguments, excessive autonomy. These are API-level problems. The sandbox protects the local system. The policy engine protects the conversation. Most real incidents will be caught by the policy engine, not the sandbox.

I emphasize this because I don't want anyone to think that applying a kernel sandbox makes their agent "secure." It makes the agent's local environment more secure. The larger governance problem requires the second layer.

Sandboxing AI agents is a good idea. Kernel-level enforcement is the strongest form of it. But a sandbox without policy is just a smaller cage with no rules about what happens inside it. The combination of kernel isolation and policy-driven governance gives you both: hard boundaries at the process level and intelligent, adaptive governance at the API level. One catches the agent trying to read your SSH keys. The other catches the agent trying to email them to someone.

You need both.

See governance at runtime

TapPass is in private beta. If your team is shipping AI agents, we'd rather get you on the product than in a pipeline.

Request beta access More reading

Sandboxing AI Agents: Why Isolation Alone Is Not Enough