Cost Controls and Observability

Lesson 8 · Safe Agentic Workflows · ~8 minutes

Scheduled agents are powerful — they work while you sleep. But they also spend while you sleep. Without cost controls, a daily workflow can quietly accumulate hundreds of dollars per month in AI credits and Actions minutes. This lesson gives you the tools to prevent runaway costs and gain visibility into what your agents are actually doing.

The Cost Problem

Every workflow run incurs two costs:

AI Credits (AIC) — token consumption from the language model. More complex prompts, larger context windows, and more capable models all increase this.
Actions minutes — compute time for the runner executing your workflow. This includes time waiting for the model to respond.

⚠️ The Compound Effect

A single run might cost $5. That seems fine. But $5 × 365 days = $1,825/year. And that's one workflow. Most teams end up with 5–15 scheduled workflows. Without budgets, the default behavior is uncapped spending that scales linearly with time.

The core principle: every scheduled workflow must have an explicit budget. Never rely on defaults for production automation.

Hard Budget: `max-ai-credits`

The max-ai-credits frontmatter field sets a hard ceiling on AI credit consumption per run. If the workflow reaches this limit, it stops immediately — mid-analysis if necessary.

.github/workflows/daily-review.md

---
description: "Daily code review suggestions"

on:
  schedule: daily around 8am on weekdays

engine:
  id: copilot
  model: gpt-4.1-mini

max-ai-credits: 200
max-daily-ai-credits: 800

safe-outputs:
  create-issue:
    title-prefix: "[daily-review] "
    close-older-issues: true
---

Key behaviors of max-ai-credits:

Hard stop — the run terminates when the budget is hit. No partial outputs are published.
Per-run scope — each trigger gets its own budget. A daily workflow with max-ai-credits: 200 can spend 200 per day.
Stacks with daily cap — use max-daily-ai-credits as a second layer of protection against multiple retries or manual triggers burning through budget.
Default is 1000 — if you omit it, each run can spend up to 1000 AIC. Always set it explicitly.

Model Selection and Cost Tradeoffs

Not every workflow needs the most capable model. Choosing the right model for the task is the single biggest cost lever you have.

Model	Best For	Relative Cost
`gpt-4.1-mini`	Scanning, summarization, pattern matching, routine monitoring	$ (cheapest)
`claude-haiku-4-5`	Fast classification, simple analysis, structured extraction	$
`gpt-4.1`	Code generation, complex reasoning, multi-step analysis	$$$
`claude-sonnet-4`	Nuanced writing, architectural analysis, code review	$$$
`o3` / `claude-opus-4`	Multi-step reasoning, complex refactoring, research	$$$$$

Rule of thumb: start every scheduled workflow with the cheapest model that produces acceptable output. Upgrade only when you see quality failures in the logs.

Optimization Levers

Beyond model selection, four levers reduce cost per run:

Lever	How	Impact
Tighter prompts	Remove preamble, examples, and instruction the model doesn't need. Be specific about output format.	20–40% fewer input tokens
Fewer output tokens	Ask for concise output. "Summarize in 3 bullets" costs less than "write a detailed report."	30–60% fewer output tokens
Skip-if conditions	Add `skip-if` logic so the workflow doesn't run when there's nothing to analyze (no new commits, no open PRs).	Eliminates entire runs
Scoped file reads	Use `permissions: contents: read` with path filters instead of reading the entire repo.	50–80% less context

skip-if example

---
on:
  schedule: daily around 9am on weekdays

skip-if:
  no-commits-since: 24h
  no-open-prs: true
---

# Only runs if there's actually something to review

Observability: `gh aw logs`

You can't optimize what you can't measure. The gh aw logs command shows exactly what each run consumed:

$ gh aw logs daily-review --last 7

Run ID      Date        Duration  AIC Used  Status
─────────   ──────────  ────────  ────────  ──────
a3f2c1e     Jun 28      42s       148       ✓ success
b7d4e2a     Jun 27      38s       135       ✓ success
c9a1f3b     Jun 26      1m12s     312       ✓ success  ← spike
d2e5a4c     Jun 25      35s       127       ✓ success
e4b6c7d     Jun 24      40s       141       ✓ success
f1a8d9e     Jun 23      0s        0         ⊘ skipped (no-commits)
g3c2e1f     Jun 22      0s        0         ⊘ skipped (weekend)

Key things to look for:

Cost spikes — a run that used 2× normal credits indicates the model encountered something unusual (large file, complex analysis). Investigate with gh aw logs <run-id> --detail.
Consistent skips — if a workflow skips 5 of 7 days, consider changing it to weekly.
Duration outliers — long runs may be hitting rate limits or retrying.

Auditing Failures: `gh aw audit`

When a workflow fails or produces unexpected output, gh aw audit shows the full execution trace:

$ gh aw audit daily-review --run c9a1f3b

Run: c9a1f3b (Jun 26, 1m12s, 312 AIC)
Status: success (but over typical budget)

Timeline:
  00:00  Started. Model: gpt-4.1-mini
  00:02  Read 14 files (src/api/*.ts) — 8,200 tokens
  00:08  Read 6 files (tests/*.ts) — 4,100 tokens     ← unusual
  00:15  Model response: 2,800 tokens (analysis)
  00:42  Model response: 1,200 tokens (issue body)
  01:12  Output: created issue #247

Token breakdown:
  Input:  18,400 tokens (context + prompt)
  Output:  4,000 tokens (responses)
  Total:   312 AIC

Note: Input was 2x normal due to test file reads triggered by
      new test files added in commit a1b2c3d.

The audit trail answers "why did this run cost more?" and "what went wrong?" — essential for debugging scheduled automation that runs without supervision.

OpenTelemetry Integration

For teams running multiple workflows, CLI inspection doesn't scale. Export telemetry to your observability stack for dashboards and alerting:

.github/aw-config.yml

telemetry:
  otlp:
    endpoint: https://otel.yourcompany.com:4317
    headers:
      Authorization: "Bearer ${secrets.OTLP_TOKEN}"
    export:
      - traces      # Full execution spans
      - metrics     # AIC usage, duration, token counts
      - logs        # Model interactions (redacted)

  alerts:
    - name: budget-spike
      condition: aic_used > 2 * avg(aic_used, 7d)
      notify: slack:#agentic-ops

📊 Dashboards

Track AIC spend per workflow, cost trends over time, model utilization. Grafana, Datadog, or any OTLP-compatible backend.

🚨 Alerting

Get notified on cost spikes, repeated failures, or budget exhaustion before it becomes a monthly bill surprise.

🔍 Traces

See exactly which files were read, which model calls were made, and where time was spent — per run.

📈 Trends

Spot workflows whose costs are creeping up as the repo grows, so you can optimize before the bill lands.

Outcomes Measurement

Cost control isn't just about spending less — it's about spending well. A workflow that costs $3/day but whose output is ignored every time is wasting $90/month. Track whether your automation actually helps:

Acceptance rate — how often do humans act on the workflow's output? If you close the issue without reading it, the workflow isn't useful.
Time saved — does the automated report replace 20 minutes of manual investigation?
Error catch rate — for monitoring workflows, how often does it surface something a human wouldn't have noticed?
Noop ratio — a high noop rate means the workflow runs but finds nothing. Consider reducing frequency.

$ gh aw stats daily-review --last 30d

Runs: 22 (8 skipped)
Avg cost: 145 AIC ($1.45/run)
Monthly spend: ~$32
Acceptance rate: 73% (issues read within 4h)
Action rate: 45% (led to a commit or PR within 24h)
Noop rate: 18%

A workflow with a low action rate isn't necessarily bad — a security scanner that finds nothing 90% of the time is doing its job. But a daily summary that's never read should be made weekly or killed.

Concurrency Controls

Prevent duplicate runs from piling up — especially when a workflow is triggered by both schedule and manual dispatch, or when a slow run overlaps with the next scheduled execution:

concurrency configuration

---
concurrency:
  group: daily-review-${branch}
  cancel-in-progress: true
---

Behaviors:

group — runs with the same group name are mutually exclusive. Only one runs at a time.
cancel-in-progress: true — if a new run starts while an old one is still going, the old one is cancelled. This prevents stacking.
Without concurrency — if a workflow takes 5 minutes but triggers every 3 minutes, you'll get overlapping runs that each consume full budget.

Staged Mode: Preview Before Going Live

When you first deploy a scheduled workflow — or after making significant changes — use staged: true to preview what it would do without actually publishing outputs:

staged mode

---
staged: true   # Runs the full workflow but doesn't publish outputs

safe-outputs:
  create-issue:
    title-prefix: "[daily-review] "
---

In staged mode:

The workflow runs end-to-end (consuming credits — so keep max-ai-credits low)
Outputs are captured but not published — no issues created, no comments posted
You review outputs with gh aw logs <run-id> --staged-output
Once satisfied, remove staged: true to go live

This is your safety net: validate cost, quality, and relevance before committing to daily spend.

Connection to Your Work

As you add scheduled workflows (the background agents from Lesson 5), cost control moves from "nice to have" to essential infrastructure. Here's the progression:

Stage	What to Do
First workflow	Set `max-ai-credits`, use cheapest model, enable `staged: true`
2–5 workflows	Add `max-daily-ai-credits`, check `gh aw logs` weekly, tune budgets down
5+ workflows	Set up OTLP export, create a cost dashboard, add spike alerts
Team-wide adoption	Establish org-level budget policies, review outcomes monthly, kill low-value workflows

✅ Cost Optimization Checklist

Set explicit max-ai-credits on every scheduled workflow (never rely on the 1000 default)
Set max-daily-ai-credits as a safety net against retries and manual triggers
Use the cheapest model that produces acceptable output (gpt-4.1-mini or claude-haiku-4-5 for monitoring)
Add skip-if conditions to avoid running when there's nothing to analyze
Add concurrency config to prevent overlapping runs
Deploy new workflows with staged: true first — review output before going live
Check gh aw logs weekly for cost spikes and unnecessary runs
Track acceptance rate — kill or reduce frequency of workflows nobody reads
Keep prompts tight: specify output format, limit scope, remove unnecessary instructions
Scope file reads with path filters instead of reading the entire repository
Set up OTLP export and alerts once you have 5+ active workflows

Check Your Understanding

What happens when a workflow run hits its max-ai-credits limit?

The run continues but switches to a cheaper model for the remainder
The run pauses and waits for manual approval to continue
The run stops immediately — no partial outputs are published
The run completes but the overage is charged at a penalty rate

Correct — max-ai-credits is a hard stop. When the budget is exhausted, the workflow terminates immediately and no outputs are published. This prevents a runaway analysis from producing incomplete or misleading results.

Not quite. max-ai-credits is a hard ceiling — when reached, the run stops immediately with no partial outputs published. It doesn't downgrade models, pause for approval, or charge a penalty. The workflow simply terminates to protect your budget.

What is the purpose of staged: true in a workflow?

It runs the workflow in a sandboxed environment with no repository access
It runs the full workflow but captures outputs without publishing them, so you can preview before going live
It splits the workflow into stages that require manual approval between each step
It schedules the workflow to run only during off-peak hours to save on compute costs

Right — staged mode runs the complete workflow end-to-end (consuming credits) but holds all outputs without publishing. You review them with gh aw logs <run-id> --staged-output and remove staged: true once you're confident in the quality and cost.

Staged mode runs the full workflow normally — including model calls and file reads — but captures outputs without publishing them (no issues created, no comments posted). It's a preview mechanism so you can validate cost and quality before committing to ongoing automated output.

How would you identify which of your scheduled workflows is the most expensive?

Check the GitHub billing page under "AI Credits" for a workflow-level breakdown
Run gh aw audit --all and sort by error count
Run gh aw logs for each workflow and compare the AIC Used column, or use OTLP dashboards for aggregate views
Check the max-ai-credits value in each workflow file — the highest budget is the most expensive

Correct — gh aw logs shows actual AIC consumed per run, which you can compare across workflows. For teams with many workflows, OTLP dashboards aggregate this automatically. Note that max-ai-credits is the budget cap, not actual spend — a workflow might use far less than its limit.

The max-ai-credits value is just the budget cap — actual spend may be much lower. To find the most expensive workflow, check gh aw logs for actual AIC usage per run, or set up OTLP dashboards that aggregate cost metrics across all workflows. The billing page exists but doesn't break down by individual workflow.

What's Next

You now know how to keep your agentic workflows on a budget and how to see exactly what they're doing. Cost controls and observability aren't optional add-ons — they're the foundation that makes all your other automation sustainable. The next lesson covers testing and validation — how to verify that your workflows produce correct outputs before trusting them in production.

Primary Source

Cost Management Reference — complete documentation on max-ai-credits, max-daily-ai-credits, model pricing, OTLP export configuration, and budget alerting.

Questions? Ask me about budget sizing for specific workflow types, OTLP setup, model selection tradeoffs, or how to calculate ROI on your scheduled workflows.

← Back Next →