Lesson 3: The Security Model — Safe Outputs and Sandboxing

You've built hooks and commands that guide an agent. But to trust an agent with real autonomy — running unattended on your codebase — you need something stronger: a security model that makes catastrophic mistakes structurally impossible, not merely discouraged.

gh-aw solves this with defense-in-depth: multiple independent layers where any single layer's failure is contained by the others.

Primary Source

gh-aw Security Architecture — The complete reference for everything in this lesson. Read it after if you want the full detail.

The Core Insight: Read-Only Agents

Key principle

The agent job runs with read-only permissions. It cannot write to your repository, create PRs, post comments, or modify any external state. All writes happen in separate jobs that execute only after the agent's output has been validated.

This means even a fully compromised agent — one that's been prompt-injected, confused, or running arbitrary code — cannot directly damage anything. It can only produce an output artifact. What happens to that artifact is decided by trusted code the agent never touches.

The Three Trust Layers

DEFENSE-IN-DEPTH — THREE TRUST LAYERS

LAYER 1:
SUBSTRATE

Container isolation + network firewall
Hardware-enforced boundaries. Even arbitrary code can't escape.
Docker containers, iptables network rules, MCP server sandboxing, kernel-enforced memory isolation.

LAYER 2:
CONFIGURATION

Schema validation + action pinning + security scanners
Compile-time checks that reject bad configurations before they run.
JSON schema validation, SHA-pinned actions (prevents supply chain attacks), actionlint + zizmor + poutine scanning.

LAYER 3:
PLAN

SafeOutputs + threat detection + secret redaction
Runtime behavior constraints. Controls what the agent can do with its channels.
Structured output artifacts, AI-powered threat analysis, credential scrubbing, content sanitization.

Each layer protects against different failure modes. The substrate stops container escapes. Configuration stops misconfigurations and supply chain attacks. The plan stops malicious or confused agent outputs from reaching production.

Safe Outputs: The Permission Separation Pattern

Safe Outputs is the mechanism that enforces the read-only agent principle. Here's exactly how it works:

Agent Job
read-only
Reads code, reasons, produces JSON

Output Artifact
agent_output.json
Structured actions the agent wants to perform

Threat Detection Job
no write access
AI analyzes for secrets, backdoors, policy violations

Safe Output Jobs
scoped write
create_issue · add_comment · create_pull_request

The critical property: the agent requests actions via structured JSON. It says "I want to create a PR with this diff." It does not create the PR. A separate job — one the agent never influences — decides whether to execute that request.

Each safe output job gets the minimum permissions needed for its specific action:

Agent Workflow Firewall (AWF)

Even though the agent is read-only, it still runs code. A compromised agent could try to exfiltrate data over the network. The Agent Workflow Firewall prevents this:

Safe Output Job	Permissions Granted
`create_issue`	`issues: write`
`add_comment`	`issues: write`
`create_pull_request`	`contents: write`, `pull-requests: write`
`add_labels`	`issues: write`

Agent Container
Isolated Docker network
No direct internet access

▶

Squid Proxy
Domain allowlist enforced
All HTTP/HTTPS routed here

▶

Internet
Only allowed domains reachable
Everything else dropped

How it works: The agent is containerized. iptables rules redirect all HTTP/HTTPS traffic through a Squid proxy. The proxy enforces a domain allowlist configured in the workflow's network: frontmatter. Requests to unlisted domains are silently dropped.

The firewall container drops its own iptables capabilities before launching the agent — so even arbitrary code execution inside the agent container can't modify the firewall rules.

Threat Detection Pipeline

Between the agent's output and the safe output jobs sits a threat detection job. This is a separate AI agent with a security-focused system prompt that analyzes the buffered output for:

The detection job has no write permissions and no access to the original agent's runtime. It only sees the artifacts. Its sole output is a pass/fail verdict:

Customizable detection

You can augment the AI detection with additional tools (Semgrep, TruffleHog) and custom prompts. For example, you might add: "Also reject any change to .github/workflows/ files."

Content Sanitization

Before user-generated content (issue titles, PR bodies, comments) even reaches the agent, gh-aw sanitizes it. This protects the agent from prompt injection via input:

Mechanism	What It Does	Protects Against
@mention neutralization	@user → `@user`	Unintended user notifications
Bot trigger protection	fixes #123 → `fixes #123`	Automatic issue linking
XML/HTML tag conversion	<script> → (script)	Injection via markup
URI filtering	Only HTTPS from trusted domains	Data exfiltration URLs
Content limits	0.5 MB max, 65k lines max	Denial-of-service payloads
Control character removal	ANSI escapes stripped	Terminal manipulation

Sanitization happens at the activation stage boundary — before the agent sees anything. The agent works with clean, normalized text.

Connection to Your Existing Patterns

If you've used the /comply pattern (or any independent reviewer pattern), you already understand the principle behind threat detection:

Same principle, different scale

Your /comply hook uses an independent reviewer that examines an agent's work product without sharing its context. The threat detection pipeline is the same idea, formalized: a separate agent with a security-focused prompt reviews the output before it's applied. The difference is that gh-aw makes this a structural guarantee — it's not a suggestion, it's a gate that blocks execution.

The pattern scales: local agents use hooks as guardrails (best-effort enforcement via your coding tools), while gh-aw hardens this into kernel-enforced isolation + permission separation + AI threat analysis. Same layered thinking, different trust boundaries.

Security Layers Summary

Check Your Understanding

Layer	Mechanism	What It Stops
Substrate	VM + Docker containers	Memory corruption, host escape
Substrate	AWF network firewall (iptables + Squid)	Data exfiltration, unauthorized API calls
Substrate	MCP server sandboxing	Container escape, unauthorized tool access
Configuration	Schema validation + expression allowlist	Invalid configs, injection via expressions
Configuration	Action SHA pinning	Supply chain attacks, tag hijacking
Configuration	Security scanners (actionlint, zizmor, poutine)	Privilege escalation, misconfigurations
Plan	SafeOutputs (permission separation)	Direct write access abuse
Plan	Threat detection (AI + scanners)	Malicious patches, secret leaks
Plan	Content sanitization	Prompt injection via input
Plan	Secret redaction	Credential leakage in logs/artifacts

1. What permissions does the agent job itself have?

Full write access scoped to the repository
Read-only — it cannot write to any external state
Write access gated by a confirmation prompt
No permissions at all — it can't even read the repo

Correct. The agent job is strictly read-only. It can read the codebase, use read-only MCP tools, and reason — but all writes are buffered as artifacts for separate jobs to execute.

The agent job runs with read-only permissions. It can read the repository and use read-only tools, but it cannot write to any external state. All writes are deferred to separate safe output jobs that execute only after validation.

2. What happens if the threat detection job identifies a problem in the agent's output?

The problematic section is removed and the rest proceeds
The agent is re-run with a corrective prompt
The entire workflow fails — no writes are externalized at all
A human is notified for manual approval

Right. Threat detection is a hard gate: if it fails, the workflow terminates and none of the safe output jobs execute. Zero writes reach the outside world.

Threat detection is all-or-nothing. If it finds threats, the entire workflow terminates — no partial execution, no retry, no writes externalized. This is the safest default: when in doubt, do nothing.

3. What does the Safe Outputs subsystem actually do?

Encrypts the agent's output before storing it
Gives the agent temporary write tokens that expire
Separates agent execution (read-only) from write execution (separate jobs with scoped permissions)
Logs everything the agent does for audit purposes

Exactly. SafeOutputs is permission isolation: the agent produces structured output artifacts, and separate jobs with minimum scoped permissions execute the actual writes. The agent never has write access.

SafeOutputs enforces permission separation. The agent job is read-only and produces structured JSON artifacts describing desired actions. Separate jobs — each with only the permissions needed for their specific action — execute those writes after threat detection passes.

What's Next

You now understand why you can trust an agent with autonomy: because the architecture makes destructive mistakes structurally impossible. In the next lesson, we'll move from theory to practice — writing your first gh-aw workflow that uses safe outputs to create a PR from an agent's work.

Questions? Ask me to explain any security layer in more detail, walk through a specific attack scenario and how it's mitigated, or compare this model to other sandboxing approaches you've seen.