Capture Your Baseline

Lesson 2 · Learn Graphify · ~15 minutes

You can't prove Graphify helps without a "before" measurement. This lesson sets up a lightweight evaluation protocol you'll run before building your first graph — so you have real numbers to compare against.

Win

After this lesson, you'll have 2-3 baseline task measurements that you can re-run with Graphify later to produce a concrete before/after comparison.

Why Evaluate

The published Graphify benchmarks report 6x–49x token reduction on 100–500 file repos. But those numbers measure a specific thing: tokens consumed during code discovery (grep/glob). Your actual workflow might benefit differently — fewer round-trips, higher first-attempt correctness, faster pattern reuse. You need your own numbers.

What to Measure

Pick metrics that matter for your stated goal (less back-and-forth, more predictable outcomes):

Metric	How to capture	Why it matters
Round-trips	Count human→agent message pairs until task completes	Direct measure of "back and forth"
Token usage	`graphify benchmark` (after graph exists) or Anthropic dashboard	Cost proxy; also correlates with context bloat
First-attempt correctness	Did the agent's first code output pass tests / match patterns? (yes/no)	Measures whether graph context improves precision
Wall-clock time	Start to "task done" timestamp	End-to-end productivity signal

Note

Keep it simple. You don't need all four. Pick round-trips (easiest to count) plus one other. Resist the urge to over-instrument.

Pick Your Eval Tasks

Choose 2-3 tasks that are representative of your real work on the DocumentDB project and repeatable (you could give the same prompt to the agent twice and compare). Good candidates:

Pattern reuse task — "Write a new test file for operator X following the existing pattern in operator Y." The agent needs to discover the pattern first.
Discovery task — "What assertion helpers are available and when should I use each one?" The agent needs to navigate the framework.
Integration task — "Add a new error code constant and use it in an existing test." Requires finding the right file, understanding conventions, and making a change.

Important

Run these tasks without Graphify first. That's the whole point — you need the baseline before the intervention.

The Eval Record Template

For each task, record this in a simple markdown file:

# Eval: [Task Name]

## Setup
- Date: YYYY-MM-DD
- Tool: Claude Code / Kiro CLI
- Graphify: off / on
- Project: DocumentDB Perf Tests

## Prompt
[Exact prompt you gave the agent]

## Results
- Round-trips: N
- First-attempt correct: yes/no
- Wall-clock: Xm Ys
- Notes: [anything notable about the session]

Step-by-Step: Run Your Baseline

Create an eval directory in this workspace:

mkdir -p ~/Graphify/evals

Pick your 2-3 tasks from the list above (or invent your own — they should match real work you'd do this week).

Open a fresh Kiro CLI or Claude Code session on the DocumentDB project — without building a Graphify graph first.

Run each task. Count round-trips. Note if the first output was correct. Record wall-clock time.

Save each result:

~/Graphify/evals/baseline-[task-slug].md

After the Baseline

Once you have baseline numbers, you'll:

Build your Graphify graph (Lesson 1 steps)
Re-run the same tasks with graph context active
Compare side-by-side

You'll also run graphify benchmark after building the graph — it measures the theoretical token reduction (graph size vs. raw file corpus). This gives you the "max possible" savings; your actual savings will be lower but should track the same direction.

Expected Results

Based on independent benchmarks for repos in the 100–500 file range:

Token savings: 6x–15x
Round-trip reduction: 2x–4x (agent finds the right files faster, fewer "let me search for that" cycles)
First-attempt correctness: likely improves for pattern-reuse tasks where the agent can see existing conventions in the graph

If your DocumentDB project is 500+ files, you may see higher savings.

Verify Your Understanding

Why must the baseline be captured before building the Graphify graph?

Building the graph corrupts the original source files irreversibly
You need an uncontaminated comparison point to prove improvement
The graph build process consumes tokens that skew measurements
Graphify disables non-graph search after the initial graph build

Which metric most directly measures "less back and forth with the agent"?

Token usage reported on the Anthropic billing dashboard page
Wall-clock time from the first prompt to the completed task
Round-trips counted as human-to-agent message pairs per task
First-attempt correctness measured as a binary pass or fail

Do This Now

Run mkdir -p ~/Graphify/evals
Pick 2-3 eval tasks representative of your DocumentDB work
Run them in a fresh session without Graphify and record results
Save as ~/Graphify/evals/baseline-*.md

Primary Source

Read Graphify Review: The 71x Claude Code Token Savings — specifically the "Where Does the 71x Token Savings Claim Come From?" section for the methodology behind the published numbers and realistic expectations by codebase size.