The Security Model — Safe Outputs and Sandboxing

Lesson 3 · Safe Agentic Workflows · ~12 minutes

You've built hooks and commands that guide an agent. But to trust an agent with real autonomy — running unattended on your codebase — you need something stronger: a security model that makes catastrophic mistakes structurally impossible, not merely discouraged.

gh-aw solves this with defense-in-depth: multiple independent layers where any single layer's failure is contained by the others.

Primary Source

gh-aw Security Architecture — The complete reference for everything in this lesson. Read it after if you want the full detail.

The Core Insight: Read-Only Agents

Here's the single most important design decision in gh-aw:

Key principle

The agent job runs with read-only permissions. It cannot write to your repository, create PRs, post comments, or modify any external state. All writes happen in separate jobs that execute only after the agent's output has been validated.

This means even a fully compromised agent — one that's been prompt-injected, confused, or running arbitrary code — cannot directly damage anything. It can only produce an output artifact. What happens to that artifact is decided by trusted code the agent never touches.

The Three Trust Layers

DEFENSE-IN-DEPTH — THREE TRUST LAYERS
LAYER 1:
SUBSTRATE
Container isolation + network firewall
Hardware-enforced boundaries. Even arbitrary code can't escape.
Docker containers, iptables network rules, MCP server sandboxing, kernel-enforced memory isolation.
LAYER 2:
CONFIGURATION
Schema validation + action pinning + security scanners
Compile-time checks that reject bad configurations before they run.
JSON schema validation, SHA-pinned actions (prevents supply chain attacks), actionlint + zizmor + poutine scanning.
LAYER 3:
PLAN
SafeOutputs + threat detection + secret redaction
Runtime behavior constraints. Controls what the agent can do with its channels.
Structured output artifacts, AI-powered threat analysis, credential scrubbing, content sanitization.

Each layer protects against different failure modes. The substrate stops container escapes. Configuration stops misconfigurations and supply chain attacks. The plan stops malicious or confused agent outputs from reaching production.

Safe Outputs: The Permission Separation Pattern

Safe Outputs is the mechanism that enforces the read-only agent principle. Here's exactly how it works:

Read-only
Artifact (data)
Threat detection
Scoped write
Agent Job
read-only
Reads code, reasons, produces JSON
Output Artifact
agent_output.json
Structured actions the agent wants to perform
Threat Detection Job
no write access
AI analyzes for secrets, backdoors, policy violations
Safe Output Jobs
scoped write
create_issue · add_comment · create_pull_request

The critical property: the agent requests actions via structured JSON. It says "I want to create a PR with this diff." It does not create the PR. A separate job — one the agent never influences — decides whether to execute that request.

Each safe output job gets the minimum permissions needed for its specific action:

Safe Output JobPermissions Granted
create_issueissues: write
add_commentissues: write
create_pull_requestcontents: write, pull-requests: write
add_labelsissues: write

Agent Workflow Firewall (AWF)

Even though the agent is read-only, it still runs code. A compromised agent could try to exfiltrate data over the network. The Agent Workflow Firewall prevents this:

Agent Container
Isolated Docker network
No direct internet access
Squid Proxy
Domain allowlist enforced
All HTTP/HTTPS routed here
Internet
Only allowed domains reachable
Everything else dropped
How it works: The agent is containerized. iptables rules redirect all HTTP/HTTPS traffic through a Squid proxy. The proxy enforces a domain allowlist configured in the workflow's network: frontmatter. Requests to unlisted domains are silently dropped.

You configure allowed domains per workflow:

network:
  firewall: true
  allowed:
    - defaults        # certificates, JSON schema
    - python          # PyPI, Conda
    - node            # npm, npmjs.com
    - "api.example.com"  # your custom API

The firewall container drops its own iptables capabilities before launching the agent — so even arbitrary code execution inside the agent container can't modify the firewall rules.

Threat Detection Pipeline

Between the agent's output and the safe output jobs sits a threat detection job. This is a separate AI agent with a security-focused system prompt that analyzes the buffered output for:

The detection job has no write permissions and no access to the original agent's runtime. It only sees the artifacts. Its sole output is a pass/fail verdict:

Customizable detection

You can augment the AI detection with additional tools (Semgrep, TruffleHog) and custom prompts. For example, you might add: "Also reject any change to .github/workflows/ files."

Content Sanitization

Before user-generated content (issue titles, PR bodies, comments) even reaches the agent, gh-aw sanitizes it. This protects the agent from prompt injection via input:

MechanismWhat It DoesProtects Against
@mention neutralization @user → `@user` Unintended user notifications
Bot trigger protection fixes #123 → `fixes #123` Automatic issue linking
XML/HTML tag conversion <script> → (script) Injection via markup
URI filtering Only HTTPS from trusted domains Data exfiltration URLs
Content limits 0.5 MB max, 65k lines max Denial-of-service payloads
Control character removal ANSI escapes stripped Terminal manipulation

Sanitization happens at the activation stage boundary — before the agent sees anything. The agent works with clean, normalized text.

Connection to Your Existing Patterns

If you've used the /comply pattern (or any independent reviewer pattern), you already understand the principle behind threat detection:

Same principle, different scale

Your /comply hook uses an independent reviewer that examines an agent's work product without sharing its context. The threat detection pipeline is the same idea, formalized: a separate agent with a security-focused prompt reviews the output before it's applied. The difference is that gh-aw makes this a structural guarantee — it's not a suggestion, it's a gate that blocks execution.

The pattern scales: local agents use hooks as guardrails (best-effort enforcement via your coding tools), while gh-aw hardens this into kernel-enforced isolation + permission separation + AI threat analysis. Same layered thinking, different trust boundaries.

Security Layers Summary

LayerMechanismWhat It Stops
SubstrateVM + Docker containersMemory corruption, host escape
SubstrateAWF network firewall (iptables + Squid)Data exfiltration, unauthorized API calls
SubstrateMCP server sandboxingContainer escape, unauthorized tool access
ConfigurationSchema validation + expression allowlistInvalid configs, injection via expressions
ConfigurationAction SHA pinningSupply chain attacks, tag hijacking
ConfigurationSecurity scanners (actionlint, zizmor, poutine)Privilege escalation, misconfigurations
PlanSafeOutputs (permission separation)Direct write access abuse
PlanThreat detection (AI + scanners)Malicious patches, secret leaks
PlanContent sanitizationPrompt injection via input
PlanSecret redactionCredential leakage in logs/artifacts

Check Your Understanding

1. What permissions does the agent job itself have?
Correct. The agent job is strictly read-only. It can read the codebase, use read-only MCP tools, and reason — but all writes are buffered as artifacts for separate jobs to execute.
The agent job runs with read-only permissions. It can read the repository and use read-only tools, but it cannot write to any external state. All writes are deferred to separate safe output jobs that execute only after validation.
2. What happens if the threat detection job identifies a problem in the agent's output?
Right. Threat detection is a hard gate: if it fails, the workflow terminates and none of the safe output jobs execute. Zero writes reach the outside world.
Threat detection is all-or-nothing. If it finds threats, the entire workflow terminates — no partial execution, no retry, no writes externalized. This is the safest default: when in doubt, do nothing.
3. What does the Safe Outputs subsystem actually do?
Exactly. SafeOutputs is permission isolation: the agent produces structured output artifacts, and separate jobs with minimum scoped permissions execute the actual writes. The agent never has write access.
SafeOutputs enforces permission separation. The agent job is read-only and produces structured JSON artifacts describing desired actions. Separate jobs — each with only the permissions needed for their specific action — execute those writes after threat detection passes.

What's Next

You now understand why you can trust an agent with autonomy: because the architecture makes destructive mistakes structurally impossible. In the next lesson, we'll move from theory to practice — writing your first gh-aw workflow that uses safe outputs to create a PR from an agent's work.

Questions? Ask me to explain any security layer in more detail, walk through a specific attack scenario and how it's mitigated, or compare this model to other sandboxing approaches you've seen.
← Back Next →