You've built hooks and commands that guide an agent. But to trust an agent with real autonomy — running unattended on your codebase — you need something stronger: a security model that makes catastrophic mistakes structurally impossible, not merely discouraged.
gh-aw solves this with defense-in-depth: multiple independent layers where any single layer's failure is contained by the others.
gh-aw Security Architecture — The complete reference for everything in this lesson. Read it after if you want the full detail.
Here's the single most important design decision in gh-aw:
The agent job runs with read-only permissions. It cannot write to your repository, create PRs, post comments, or modify any external state. All writes happen in separate jobs that execute only after the agent's output has been validated.
This means even a fully compromised agent — one that's been prompt-injected, confused, or running arbitrary code — cannot directly damage anything. It can only produce an output artifact. What happens to that artifact is decided by trusted code the agent never touches.
Each layer protects against different failure modes. The substrate stops container escapes. Configuration stops misconfigurations and supply chain attacks. The plan stops malicious or confused agent outputs from reaching production.
Safe Outputs is the mechanism that enforces the read-only agent principle. Here's exactly how it works:
agent_output.jsoncreate_issue · add_comment · create_pull_request
The critical property: the agent requests actions via structured JSON. It says "I want to create a PR with this diff." It does not create the PR. A separate job — one the agent never influences — decides whether to execute that request.
Each safe output job gets the minimum permissions needed for its specific action:
| Safe Output Job | Permissions Granted |
|---|---|
create_issue | issues: write |
add_comment | issues: write |
create_pull_request | contents: write, pull-requests: write |
add_labels | issues: write |
Even though the agent is read-only, it still runs code. A compromised agent could try to exfiltrate data over the network. The Agent Workflow Firewall prevents this:
iptables rules redirect all HTTP/HTTPS traffic through a Squid proxy. The proxy enforces a domain allowlist configured in the workflow's network: frontmatter. Requests to unlisted domains are silently dropped.
You configure allowed domains per workflow:
network:
firewall: true
allowed:
- defaults # certificates, JSON schema
- python # PyPI, Conda
- node # npm, npmjs.com
- "api.example.com" # your custom API
The firewall container drops its own iptables capabilities before launching the agent — so even arbitrary code execution inside the agent container can't modify the firewall rules.
Between the agent's output and the safe output jobs sits a threat detection job. This is a separate AI agent with a security-focused system prompt that analyzes the buffered output for:
The detection job has no write permissions and no access to the original agent's runtime. It only sees the artifacts. Its sole output is a pass/fail verdict:
You can augment the AI detection with additional tools (Semgrep, TruffleHog) and custom prompts. For example, you might add: "Also reject any change to .github/workflows/ files."
Before user-generated content (issue titles, PR bodies, comments) even reaches the agent, gh-aw sanitizes it. This protects the agent from prompt injection via input:
| Mechanism | What It Does | Protects Against |
|---|---|---|
| @mention neutralization | @user → `@user` | Unintended user notifications |
| Bot trigger protection | fixes #123 → `fixes #123` | Automatic issue linking |
| XML/HTML tag conversion | <script> → (script) | Injection via markup |
| URI filtering | Only HTTPS from trusted domains | Data exfiltration URLs |
| Content limits | 0.5 MB max, 65k lines max | Denial-of-service payloads |
| Control character removal | ANSI escapes stripped | Terminal manipulation |
Sanitization happens at the activation stage boundary — before the agent sees anything. The agent works with clean, normalized text.
If you've used the /comply pattern (or any independent reviewer pattern), you already understand the principle behind threat detection:
Your /comply hook uses an independent reviewer that examines an agent's work product without sharing its context. The threat detection pipeline is the same idea, formalized: a separate agent with a security-focused prompt reviews the output before it's applied. The difference is that gh-aw makes this a structural guarantee — it's not a suggestion, it's a gate that blocks execution.
The pattern scales: local agents use hooks as guardrails (best-effort enforcement via your coding tools), while gh-aw hardens this into kernel-enforced isolation + permission separation + AI threat analysis. Same layered thinking, different trust boundaries.
| Layer | Mechanism | What It Stops |
|---|---|---|
| Substrate | VM + Docker containers | Memory corruption, host escape |
| Substrate | AWF network firewall (iptables + Squid) | Data exfiltration, unauthorized API calls |
| Substrate | MCP server sandboxing | Container escape, unauthorized tool access |
| Configuration | Schema validation + expression allowlist | Invalid configs, injection via expressions |
| Configuration | Action SHA pinning | Supply chain attacks, tag hijacking |
| Configuration | Security scanners (actionlint, zizmor, poutine) | Privilege escalation, misconfigurations |
| Plan | SafeOutputs (permission separation) | Direct write access abuse |
| Plan | Threat detection (AI + scanners) | Malicious patches, secret leaks |
| Plan | Content sanitization | Prompt injection via input |
| Plan | Secret redaction | Credential leakage in logs/artifacts |
You now understand why you can trust an agent with autonomy: because the architecture makes destructive mistakes structurally impossible. In the next lesson, we'll move from theory to practice — writing your first gh-aw workflow that uses safe outputs to create a PR from an agent's work.