You can't prove Graphify helps without a "before" measurement. This lesson sets up a lightweight evaluation protocol you'll run before building your first graph — so you have real numbers to compare against.
After this lesson, you'll have 2-3 baseline task measurements that you can re-run with Graphify later to produce a concrete before/after comparison.
The published Graphify benchmarks report 6x–49x token reduction on 100–500 file repos. But those numbers measure a specific thing: tokens consumed during code discovery (grep/glob). Your actual workflow might benefit differently — fewer round-trips, higher first-attempt correctness, faster pattern reuse. You need your own numbers.
Pick metrics that matter for your stated goal (less back-and-forth, more predictable outcomes):
| Metric | How to capture | Why it matters |
|---|---|---|
| Round-trips | Count human→agent message pairs until task completes | Direct measure of "back and forth" |
| Token usage | graphify benchmark (after graph exists) or Anthropic dashboard |
Cost proxy; also correlates with context bloat |
| First-attempt correctness | Did the agent's first code output pass tests / match patterns? (yes/no) | Measures whether graph context improves precision |
| Wall-clock time | Start to "task done" timestamp | End-to-end productivity signal |
Keep it simple. You don't need all four. Pick round-trips (easiest to count) plus one other. Resist the urge to over-instrument.
Choose 2-3 tasks that are representative of your real work on the DocumentDB project and repeatable (you could give the same prompt to the agent twice and compare). Good candidates:
Run these tasks without Graphify first. That's the whole point — you need the baseline before the intervention.
For each task, record this in a simple markdown file:
# Eval: [Task Name]
## Setup
- Date: YYYY-MM-DD
- Tool: Claude Code / Kiro CLI
- Graphify: off / on
- Project: DocumentDB Perf Tests
## Prompt
[Exact prompt you gave the agent]
## Results
- Round-trips: N
- First-attempt correct: yes/no
- Wall-clock: Xm Ys
- Notes: [anything notable about the session]
Create an eval directory in this workspace:
mkdir -p ~/Graphify/evals
Pick your 2-3 tasks from the list above (or invent your own — they should match real work you'd do this week).
Open a fresh Kiro CLI or Claude Code session on the DocumentDB project — without building a Graphify graph first.
Run each task. Count round-trips. Note if the first output was correct. Record wall-clock time.
Save each result:
~/Graphify/evals/baseline-[task-slug].md
Once you have baseline numbers, you'll:
You'll also run graphify benchmark after building the graph — it measures the theoretical token reduction (graph size vs. raw file corpus). This gives you the "max possible" savings; your actual savings will be lower but should track the same direction.
Based on independent benchmarks for repos in the 100–500 file range:
If your DocumentDB project is 500+ files, you may see higher savings.
Why must the baseline be captured before building the Graphify graph?
Which metric most directly measures "less back and forth with the agent"?
mkdir -p ~/Graphify/evals~/Graphify/evals/baseline-*.mdRead Graphify Review: The 71x Claude Code Token Savings — specifically the "Where Does the 71x Token Savings Claim Come From?" section for the methodology behind the published numbers and realistic expectations by codebase size.