The Benchmark Loop

Lesson 4 · Learn Graphify · ~8 minutes

You've built a graph and captured your baseline. Now measure the theoretical maximum savings — and learn how to use benchmarking as a feedback loop for graph quality.

Win

After this lesson, you can run graphify benchmark, interpret the output, and use it to decide whether your graph needs tuning.

What the Benchmark Measures

The benchmark command compares two approaches:

Naive approach: Read every source file into context (total token count of your codebase)
Graph approach: Read graph.json into context (token count of the graph)

The ratio between these is your theoretical token reduction. It represents the maximum possible savings if the agent never needs to read a source file at all.

graphify benchmark

Reading the Output

# Example output:
# Source corpus: 1,247,832 tokens (423 files)
# Graph size:      18,442 tokens
# Reduction: 67.7x
# 
# Largest communities by token weight:
#   Test Framework Assertions: 4,211 tokens (22.8%)
#   MongoDB Connection Pool:   3,102 tokens (16.8%)
#   Parametrize Helpers:       2,889 tokens (15.7%)

Field	What it tells you
Source corpus	What the agent would consume without a graph
Graph size	What the agent actually consumes with graph context
Reduction	Theoretical max savings (your ceiling)
Community weights	Where the graph spends its tokens — heavily-connected areas cost more

Theoretical vs. Actual

Your actual savings will be lower than the benchmark number. The agent still reads some source files after consulting the graph — it just reads fewer of them, and the right ones.

Rules of thumb:

Benchmark says 60x+: Expect actual savings of 10x–20x in practice
Benchmark says 10x–30x: Expect actual savings of 3x–8x
Benchmark says <5x: Small codebase — the graph adds minimal value; the agent could just read everything

Note

If your benchmark is under 5x, your project might be small enough that Graphify isn't worth the overhead. That's fine — it's a tool for scale.

Using Benchmarks as a Feedback Loop

Run graphify benchmark after each significant change to your graph config:

After initial build — record your baseline ratio
After excluding files (tests, generated code) — did the ratio improve?
After a major refactor — did the graph grow proportionally or blow up?
After switching clustering backends — did community quality change?

If the ratio drops significantly after a code change, the graph may be bloated. Consider excluding generated files or using --no-cluster for a lean extraction.

Excluding Files for Better Ratios

Large codebases often have files that add noise without value:

# Create a .graphifyignore file (same syntax as .gitignore)
echo "*.generated.py
docs/
__pycache__/
*.pyc" > .graphifyignore

# Rebuild
graphify .

Re-run graphify benchmark after excluding — you want a graph that's dense with signal, not padded with generated code.

Verify Your Understanding

If graphify benchmark reports a 45x reduction, what actual savings should you expect?

Exactly 45x fewer tokens in every AI session
Roughly 8x–15x in practice, since the agent still reads some files
Zero savings — the benchmark only measures disk space
45x on first query, then diminishing returns each subsequent query

When should you not bother with Graphify?

When your project uses Python instead of TypeScript
When the benchmark ratio is under 5x — the codebase is small enough to read directly
When you're using Claude Code instead of Kiro CLI
When your project has more than 3 contributors

Do This Now

Run graphify benchmark on your project
Record the reduction ratio alongside your Lesson 2 baseline numbers
If the ratio is under 10x, consider whether the project needs a graph at all
Try adding a .graphifyignore and re-benchmarking — see if the ratio improves