LLM Wiki — Agentic AI Landscape

Mon, 01 Jan 0001 00:00:00 +0000

Cross-Source Theme Analysis#

16 sources, 8 tools, 2 standards, 3 methodologies, 1 practitioner account, 2 skill/eval resources. Here are the themes that appear across 3+ sources independently — not because they reference each other, but because they converged on the same ideas.

Note: This analysis was originally written against 11 sources. The 5 newest sources (Paperclip, Spec Kit, BMad Method, Anthropic Eval Guide, Promptfoo) strengthen existing themes — particularly Theme 3 (human-in-the-loop spectrum) and Theme 7 (evaluation). A full refresh is recommended when the wiki reaches 20+ sources.

Mon, 01 Jan 0001 00:00:00 +0000

How to Eval a Skill (Practical Guide)#

Anthropic’s prompt evals measure whether a prompt produces good output. Skill evals are harder because a skill has more surface area: it needs to trigger correctly, execute the right steps, use the right tools, produce the right output, and NOT trigger on the wrong inputs.

This guide maps Anthropic’s eval methodology onto skills, drawing from the wiki’s sources.

The Key Difference: Prompts vs. Skills#

	Prompt Eval	Skill Eval
What you test	Does this prompt produce good output?	Does this skill trigger, execute, and produce correctly?
Input	A prompt + expected output	A prompt + context + expected behavior chain
Failure modes	Bad output	Wrong trigger, wrong steps, wrong tools, bad output, false positive activation
Non-determinism	Output varies	Trigger, routing, tool selection, AND output all vary

A skill eval must test the full chain: routing → activation → execution → output → side effects.

Mon, 01 Jan 0001 00:00:00 +0000

Key Insights: The Agentic AI Landscape (April 2026)#

Synthesized from 16 sources across this wiki. This analysis captures the patterns, tensions, and emerging consensus visible when you look across the entire landscape rather than at any single tool.

1. Five Layers Are Emerging#

The landscape has organized into five distinct layers:

Layer	Representative	Core Bet
Company	paperclip	Orchestrate agents into companies with org charts, budgets, governance.
Methodology	spec-kit, bmad-method	Structure the development process. Specs before code, or adaptive agile workflows.
Infrastructure	scion (GCP)	The agent runtime is the hard problem. Be a hypervisor. Stay agnostic.
Product	kiro (AWS)	Ship an opinionated end-to-end agent. Autonomy and scale matter most.
Tool	claude-code (Anthropic), fabric	Make the individual agent excellent. Let users compose upward.

The methodology layer is new — spec-kit (“specs before code”) and bmad-method (“expert collaboration over autopilot”) represent two competing philosophies for structuring AI-assisted development. paperclip adds the company layer above everything, orchestrating agents into organizations with budgets and governance.

Mon, 01 Jan 0001 00:00:00 +0000

Evidence Map: Supporting the Ten Pillars Framework#

This analysis maps each pillar from ten-pillars-agentic-skill-design against real-world evidence collected across 11 sources in this wiki. Your paper acknowledged “no original controlled study” as a limitation — the wiki now provides post-hoc validation from production implementations.

Pillar 1: Architecture and Structure#

Your claim: Organize content into clearly defined sections — metadata, interfaces, core logic, workflows, configuration.

Supporting evidence:

agent-skills-standard codified this into a formal spec: SKILL.md with YAML frontmatter (name, description, license, compatibility, metadata, allowed-tools) + markdown body + optional scripts/references/assets directories. This is now an open standard at agentskills.io.
claude-code implements it: .claude directory with CLAUDE.md, .claude/rules/, .claude/skills/, .claude/agents/. Hierarchical, scoped (org → project → user → local).
pai takes it further: USER/ vs SYSTEM/ separation. Six layers of customization (identity, preferences, workflows, skills, hooks, memory). Upgrade-safe architecture.
skills-pipeline-sleestk follows the spec exactly: each skill is a directory with SKILL.md + references/ subdirectory.

Strength: Strong. Multiple independent implementations converged on the same structure. The Agent Skills spec formalizes what your paper recommended.

Mon, 01 Jan 0001 00:00:00 +0000

Agent Skills Standard#

An open standard (agentskills.io) for packaging reusable capabilities that AI agents can load dynamically. Initiated by anthropic, designed for cross-tool compatibility.

Specification#

A skill is a directory with a SKILL.md file:

skill-name/
├── SKILL.md # Required: YAML frontmatter + markdown instructions
├── scripts/ # Optional: executable code
├── references/ # Optional: documentation
├── assets/ # Optional: templates, resources

Required Frontmatter#

name: Max 64 chars, lowercase + hyphens, must match directory name
description: Max 1024 chars, what it does and when to use it

Optional Frontmatter#

license, compatibility, metadata, allowed-tools (experimental)

Progressive Disclosure#

The key design principle — skills scale without blowing up context:

Mon, 01 Jan 0001 00:00:00 +0000

Agent State Model#

scion tracks agent state across three dimensions:

Phase (Lifecycle)#

The infrastructure lifecycle of the agent container:

created → provisioning → cloning → starting → running → stopping → stopped (or error)

Activity (Cognitive State)#

What the agent is doing within the running phase:

Sticky activities: completed, blocked, and limits_exceeded persist until explicit restart or stop
blocked: Set by agents themselves when waiting for an expected event (e.g., child agent completing) — prevents false stall detection
offline: Occurs when heartbeat is lost, often due to auth token refresh failure. Fix: stop and restart the agent.

Detail (Freeform Context)#

Freeform text about the current activity — tool name, message, task summary.

Mon, 01 Jan 0001 00:00:00 +0000

Agent#

In scion, an Agent is an isolated process running an LLM + harness loop against a task. It is the fundamental unit of execution.

Properties#

Each agent has:

Identity: Unique name, container ID, and (in hosted mode) a UUID and URL-safe slug
Home directory: Mounted at /home/<user> — contains harness config, credentials, agent-info.json
Workspace: Mounted at /workspace — a dedicated git worktree or cloned repo
Template: The template blueprint that seeded its configuration
Harness: The LLM-specific adapter (Claude, Gemini, OpenCode, Codex)

State Model#

See agent-state-model for the full three-dimensional state tracking system.

Mon, 01 Jan 0001 00:00:00 +0000

Context Management#

Strategies for managing the limited context window available to LLM agents, especially in multi-skill pipelines where different agent personas must share information efficiently.

Why It Matters#

Context windows are finite. Every skill, instruction, tool definition, and conversation turn consumes tokens. Without management, agents hit limits, lose early instructions, or waste budget on irrelevant context.

Progressive Disclosure#

The agent-skills-standard and claude-code both use this pattern:

Metadata (~100 tokens): Loaded at startup for all skills — just name + description
Instructions (<5000 tokens): Loaded only when skill activated
Resources: Loaded only when specifically needed

claude-code extends this: MCP tool definitions are deferred by default (only names consume context until a tool is used).

Mon, 01 Jan 0001 00:00:00 +0000

Frontier Agent#

A term coined by aws for a new class of AI agents with three defining characteristics:

Autonomous: Direct them towards a goal, and they figure out how to achieve it
Massively scalable: Able to perform multiple concurrent tasks and distribute work across agents
Work independently: Operating for hours or days without intervention

Distinction from Regular Agents#

Regular AI agents can plan and execute multi-step tasks with some autonomy. Frontier agents go further — they’re designed for long-running, independent operation at scale, not just responding to individual prompts or short interactive sessions.

Mon, 01 Jan 0001 00:00:00 +0000

Grove#

In scion, a Grove is a project workspace where agents live. It corresponds to a .scion directory on the filesystem.

Scope#

Project-level: Located at the root of a git repository
Global: In the user’s home folder (~/.scion)

Identification#

Every grove has a unique Grove ID:

Git-backed groves: Deterministic UUID v5 derived from namespace + normalized git URL — same repo always maps to same ID regardless of protocol
Hub-native groves: Random UUID v4

Contents#

A grove contains:

Mon, 01 Jan 0001 00:00:00 +0000

Harness#

In scion, a Harness is an adapter that allows a specific LLM tool to run within the Scion orchestration layer. It handles provisioning, configuration, and execution for that tool inside an OCI container.

Purpose#

The harness ensures that generic Scion commands (start, stop, attach, resume) work consistently regardless of the underlying agent software.

Interface#

Each harness implements (Go):

DiscoverAuth() — Locate credentials on the host
GetEnv() — Map credentials to container environment variables
GetCommand() — Build the correct CLI invocation
Provision() — Harness-specific setup during agent creation
PropagateFiles() — Copy config files into agent home
GetVolumes() — Define volume mounts

Supported Harnesses#

Harness	Target Tool	Notes
`gemini`	gemini-cli	Default harness. API key / OAuth / Vertex AI auth
`claude`	claude-code	Anthropic API key / Vertex AI auth. See claude-code for full capabilities.
`opencode`	opencode	Experimental. No hook support
`codex`	codex	Runs `--full-auto` by default
`generic`	Any CLI tool	Fallback adapter

Capability Matrix#

Capability	Gemini	Claude	OpenCode	Codex
Resume	✅	✅	✅	✅
Resume with Prompt	✅	✅	✅	❌
Hooks	✅	✅	❌	❌
OpenTelemetry	✅	✅	❌	✅
System Prompt Override	✅	✅	❌	❌

Extensibility#

New harnesses can be added via the plugin-system (hashicorp/go-plugin over gRPC) without modifying the core codebase.

Mon, 01 Jan 0001 00:00:00 +0000

Hub#

In scion, the Hub is the central control plane of the hosted (distributed) architecture. It coordinates state across multiple users, groves, and runtime-brokers.

Responsibilities#

Identity & Auth: Manages user identities via OAuth, issues tokens for brokers and agents
State Persistence: Stores definitive state of agents, groves, and templates in a central database (SQLite)
Orchestration: Dispatches agent lifecycle commands to appropriate runtime-brokers
Collaboration: Provides shared view via Web Dashboard and Hub API

Communication with Brokers#

Direct HTTP: When broker has a reachable endpoint
Control Channel (WebSocket tunnel): When broker is behind NAT/firewall

Agent Creation Flow (Hosted)#

CLI syncs grove with Hub
POST /api/v1/groves/{groveId}/agents
Hub selects a runtime-broker
Merges environment variables and secrets from all scopes (user → grove → broker)
Resolves template with content hash for broker-side caching
Dispatches to broker
Broker provisions and starts agent
Status reported back via heartbeats

Kiro Powers#

Specialized packages that enhance existing kiro agents with prebuilt expertise for specific development tasks.

Contents#

Curated MCP servers
Steering files
Hooks
Can be dynamically loaded on demand

Purpose#

Focus on providing domain-specific knowledge and best practices. Distinct from the kiro autonomous agent — Powers enhance agents with expertise, while the autonomous agent is the execution engine that works independently.

LLM Wiki Pattern#

A pattern for building personal knowledge bases where the LLM incrementally builds and maintains a persistent, interlinked wiki from raw sources. Proposed by andrej-karpathy. This wiki is a running instance of this pattern.

Core Insight#

Wiki > RAG. RAG rediscovers knowledge from scratch on every query — no accumulation. An LLM-maintained wiki compiles knowledge once and keeps it current. Cross-references, contradictions, and synthesis compound with every source added.

Mon, 01 Jan 0001 00:00:00 +0000

MCP (Model Context Protocol)#

An open protocol that standardizes how applications provide context and tools to LLMs. Enables communication between AI agents and external services.

How Claude Code Uses MCP#

claude-code uses MCP as its primary tool integration mechanism:

Connect to external data sources (Google Drive, Jira, Slack, custom tooling)
Discover and install prebuilt plugins
Create custom plugins
Tool definitions deferred by default — only tool names consume context until Claude uses a specific tool (tool search)

Usage Across the Ecosystem#

claude-code: Primary tool integration. Supports prebuilt and custom plugins.
kiro: kiro-powers contain curated MCP servers as part of expertise packages.
scion: Not directly mentioned, but the plugin-system (hashicorp/go-plugin over gRPC) serves a similar role.

Significance#

MCP is emerging as a shared standard across the coding agent ecosystem. Both Anthropic (Claude Code) and AWS (Kiro) use it, making it a potential interoperability layer between different agent tools. The agent-skills-standard (agentskills.io) is a complementary open standard: skills teach agents how to do things; MCP connects agents to external tools and data.

Mon, 01 Jan 0001 00:00:00 +0000

Multi-Agent Orchestration#

The practice of coordinating multiple LLM-based agents to work on tasks concurrently, with isolation, specialization, and collaboration.

Three Approaches Emerging#

Infrastructure-first: Scion#

scion positions itself as a “hypervisor for agents” — providing the infrastructure layer (containers, isolation, lifecycle management) while treating higher-level concerns as orthogonal. Harness-agnostic. Emphasizes human interaction as imperative.

Product-first: Kiro Autonomous Agent#

kiro’s autonomous agent is an opinionated product — a frontier-agent that handles the full stack from task intake to PR creation. Coordinates specialized sub-agents internally. Emphasizes autonomy and independence.

Mon, 01 Jan 0001 00:00:00 +0000

Plugin System#

scion supports a plugin architecture built on hashicorp/go-plugin for extending system capabilities. Plugins communicate over gRPC.

Plugin Types#

Message Broker Plugins: Custom message delivery backends for agent notifications and structured messaging
Agent Harness Plugins: Custom harness implementations that integrate new LLM tools without modifying the core codebase

Status#

Currently in foundational stage, with reference implementations available for both plugin types.

Prompt Engineering Patterns#

Structured techniques for crafting prompts within agentic skills, drawn from research and the ten-pillars-agentic-skill-design framework.

Chain of Thought (CoT)#

Structure prompts to encourage step-by-step reasoning. Break complex tasks into numbered steps. Wei et al. (2022).

ReAct Pattern#

Integrate reasoning and acting in a loop (Yao et al., 2023):

Thought: [reasoning about what to do]
Action: [specific action to take]
Observation: [result of the action]
... (repeat until task complete)

Self-Reflection (Reflexion)#

Enable agents to learn from mistakes (Shinn & Labash, 2023):

Mon, 01 Jan 0001 00:00:00 +0000

Runtime Broker#

In scion, a Runtime Broker is a compute node that registers with a hub to provide execution capacity for agents.

Responsibilities#

Manages local lifecycle of agents dispatched from the Hub
Handles workspace synchronization
Template hydration (with content hash caching)
Log streaming
Reports status back via heartbeats and agent status updates

Communication#

Connects to the Hub via:

Direct HTTP — when broker has a reachable endpoint
WebSocket Control Channel — when behind NAT/firewall

Examples of Broker Nodes#

A server, laptop, or Kubernetes cluster — any machine that can run containers.

Mon, 01 Jan 0001 00:00:00 +0000

Runtime#

In scion, the Runtime is the infrastructure layer responsible for executing agent containers. Scion abstracts container execution behind a common interface.

Supported Runtimes#

Runtime	Platform	Notes
Docker	Linux / macOS / Windows	Default fallback. Supports remote Docker hosts
Podman	Linux / macOS	Daemonless, rootless alternative
Apple Container	macOS	Native Virtualization Framework, improved performance
Kubernetes	Any (via kubeconfig)	Agents as Pods. Namespace isolation, resource specs, workspace sync via tar snapshots

Runtime Selection#

Resolved by GetRuntime factory function:

Mon, 01 Jan 0001 00:00:00 +0000

Skill Evaluation#

Moving from “it feels better” to “I have proof” when measuring AI agent skill quality.

The Problem#

LLM agents are non-deterministic. Manual testing captures one sample from a distribution. “Vibes-based” evaluation misses regressions, false positives, and edge cases. As Karpathy noted: “The eval is often harder than the task itself.”

Three-Tier Framework#

Tier	Method	Cost	Frequency	Catches
1	Deterministic graders	~$0	Every commit	Command execution, file existence, sequence, format
2	LLM-as-judge	$0.01–0.20/eval	PRs, nightly	Code quality, conventions, readability (rubric-based)
3	Human review	$0.50–5.00/eval	Sparingly	Calibration, edge cases, high-stakes decisions

“The best eval is one that actually gets run.” — Anthropic

Mon, 01 Jan 0001 00:00:00 +0000

Template#

In scion, a Template is a versioned blueprint for creating an agent. It defines the base configuration, system prompt, tools, and initial state.

Contents#

home/ directory tree — copied into the agent’s home
scion-agent.json (or .yaml) — specifies harness type, env vars, volumes, command args, model overrides, container image, resource requirements

Inheritance#

Templates support inheritance via a base field. Scion walks the chain and merges configurations bottom-up (base first, then overrides).

Scopes#

Project-level: .scion/templates/
Global: ~/.scion/templates/
Hosted: Can be scoped as global, grove, or user, with visibility controls (private, grove, public)

Management#

scion templates create
scion templates clone
scion templates list
scion templates show
scion templates update-default

Defaults#

Scion ships default templates for each supported harness: gemini, claude, opencode, codex. Users can create custom templates for specialized roles (e.g., “Security Auditor”, “React Specialist”).

Mon, 01 Jan 0001 00:00:00 +0000

Andrej Karpathy#

AI researcher and educator. Former Director of AI at Tesla, founding member of OpenAI. Known for neural network education (cs231n, Zero to Hero series) and practical AI tooling.

Relevant Work#

llm-wiki-karpathy — The LLM Wiki pattern: using LLMs to incrementally build and maintain personal knowledge bases. The foundational idea behind this wiki.

Anthropic#

AI safety company. Builds the Claude family of models and claude-code.

claude-code — Agentic coding tool (terminal, IDE, web, Slack, GitHub)

AWS#

Amazon Web Services. Cloud computing platform by Amazon.

kiro — Agentic IDE with autonomous agent capabilities
Defines the concept of frontier-agents

BMad Method#

AI-driven agile development framework with scale-adaptive intelligence. “Build More Architect Dreams.” MIT licensed.

“Traditional AI tools do the thinking for you. BMad agents guide you through a structured process to bring out your best thinking.”

Core Features#

12+ specialized agent personas: PM, Architect, Developer, UX, and more
34+ structured workflows: Grounded in agile best practices
Scale-Domain-Adaptive: Adjusts planning depth from bug fixes to enterprise systems
Party Mode: Multiple agent personas collaborate in one session
bmad-help: Context-aware guidance on what’s next
Complete lifecycle: Brainstorming → analysis → architecture → implementation → deployment

Modular Ecosystem#

Module	Purpose
BMM (core)	34+ workflows, 12+ agents
BMad Builder	Create custom agents and workflows
Test Architect (TEA)	Risk-based test strategy
Game Dev Studio	Unity, Unreal, Godot workflows
Creative Intelligence Suite	Innovation, brainstorming, design thinking

In the Ecosystem#

BMad operates at the methodology layer alongside spec-kit, but with a different philosophy:

Mon, 01 Jan 0001 00:00:00 +0000

Claude Code#

An agentic coding tool by anthropic that lives in your terminal, understands your codebase, and helps you code faster through natural language commands.

Architecture#

Claude Code is the “agentic harness” around Claude models. It provides tools, context management, and execution environment that turn a language model into a coding agent.

The Agentic Loop#

Three phases: gather context → take action → verify results, chained together with course-correction. Claude decides what each step requires based on what it learned from the previous step.

Mon, 01 Jan 0001 00:00:00 +0000

Daniel Miessler#

Security researcher and AI practitioner. Creator of fabric and pai.

Relevant Work#

fabric — Open-source framework with 251+ curated AI prompt patterns
pai — Personal AI Infrastructure built on claude-code. Persistent memory, skills, goals, continuous learning.

The two projects are complementary: Fabric = patterns (what to ask AI). PAI = infrastructure (how the AI operates).

Fabric#

An open-source framework (Go, MIT) for augmenting humans using AI. Created by daniel-miessler.

Mission: “human flourishing via AI augmentation”

Core Concept: Patterns#

Patterns are curated, well-structured prompts organized by real-world task. 251+ patterns covering:

Content extraction (YouTube, podcasts, articles)
Writing (essays, social media, documentation)
Analysis (code, claims, debates, incidents, logs)
Creation (art prompts, concept maps, changelogs)
And many more

Each pattern is a directory with a system.md file. Markdown-based, clear instructions, System section focused.

Mon, 01 Jan 0001 00:00:00 +0000

Google Cloud Platform#

Cloud computing platform by Google. Relevant to this wiki as the organization behind scion.

scion — Multi-agent orchestration testbed (hosted on GCP’s GitHub)

Kiro#

Kiro is an agentic IDE by aws for software development. It has three main surfaces:

Kiro IDE#

Interactive, synchronous collaboration on your local machine. Pair programming, suggestions, real-time code iteration. Spec-driven development.

Kiro CLI#

Custom agents as configuration files that customize Kiro’s behavior for specific workflows. Define tool access, permissions, and context. Pre-approve tools, reduce interruptions, optimize for specific tasks. Interactive, runs on your local machine.

Kiro Autonomous Agent#

A frontier-agent that works asynchronously in the background on complex, multi-step development tasks. Key capabilities:

Mon, 01 Jan 0001 00:00:00 +0000

NotebookLM#

A Gemini-powered writing and research tool from Google Labs. Co-founded by steven-johnson. Designed for AI-assisted knowledge work grounded in user-curated sources.

Core Design#

Users upload sources (documents, articles, notes) — the AI only reasons over those sources, not the open web
Two note types: Written Notes (user-authored, editable) and Saved Responses (AI-generated, immutable)
Provenance tracking built in: clear separation between human-written and AI-generated content
Source-integrated reading with AI actions (summarize, explain, find related ideas)

Relationship to the Wiki Landscape#

NotebookLM occupies a different point in the AI knowledge management space than the tools tracked elsewhere in this wiki:

Mon, 01 Jan 0001 00:00:00 +0000

PAI (Personal AI Infrastructure)#

An open-source personalized AI platform by daniel-miessler, built natively on claude-code. Turns Claude Code from a stateless tool into a persistent assistant that knows your goals, preferences, and history.

Mission: “AI should magnify everyone — not just the top 1%.”

Three Levels of AI#

Chatbots: Ask → Answer → Forget
Agentic Platforms: Ask → Use tools → Get result
PAI: Observe → Think → Plan → Execute → Verify → Learn → Improve

The key differentiator is the learn step.

Mon, 01 Jan 0001 00:00:00 +0000

Paperclip#

Open-source orchestration for zero-human companies. Node.js server + React UI. MIT licensed.

“If OpenClaw is an employee, Paperclip is the company.”

What It Does#

Orchestrates teams of AI agents into companies with:

Org charts: Hierarchies, roles, reporting lines
Goal alignment: Every task traces to company mission
Budgets: Monthly per-agent, atomic enforcement
Governance: Approval gates, rollback, audit logs
Heartbeats: Scheduled agent wake cycles
Ticket system: Every conversation traced, every decision explained
Multi-company: One deployment, many isolated companies

Agent Support#

Agent-agnostic: Claude Code, Codex, Cursor, OpenClaw, Bash, HTTP. “If it can receive a heartbeat, it’s hired.”

Mon, 01 Jan 0001 00:00:00 +0000

Promptfoo#

Open-source CLI for LLM evaluation and red teaming. Now part of OpenAI. MIT licensed.

YAML-based test cases, CI/CD integration, model comparison, red teaming. Runs locally — prompts never leave your machine. Powers apps serving 10M+ users.

The closest existing tool to a turnkey skill eval pipeline.

Scion#

Scion is an experimental multi-agent orchestration testbed by google-cloud-platform. It manages concurrent LLM-based agents running in containers across local machines and remote clusters.

What It Is#

A “hypervisor for agents” — infrastructure for running, isolating, and managing LLM agent processes. It is explicitly not a full multi-agent framework. Components like agent memory, chatrooms, and task management are treated as orthogonal concerns.

Architecture#

Scion follows a Manager-Worker pattern:

CLI (scion): Host-side orchestrator managing agent lifecycle and groves
Agents: Isolated containers running LLM software via harness adapters

Two operating modes:

Mon, 01 Jan 0001 00:00:00 +0000

Spec Kit#

Open-source toolkit by GitHub for Spec-Driven Development. CLI (specify) + slash commands across 30+ AI agents. MIT licensed.

“Build high-quality software faster. Focus on product scenarios and predictable outcomes instead of vibe coding.”

Core Workflow#

/speckit.constitution → Establish project principles
/speckit.specify → Define requirements (the "what" and "why")
/speckit.clarify → Structured questioning to reduce rework
/speckit.plan → Technical implementation plan (the "how")
/speckit.tasks → Actionable task breakdown with dependencies
/speckit.implement → Execute all tasks

Optional: /speckit.analyze (consistency check), /speckit.checklist (quality validation)

Mon, 01 Jan 0001 00:00:00 +0000

Steven Johnson#

Editorial Director and Co-Founder of notebooklm. Author of 14 books (latest: The Infernal Machine). Writes how-to guides and thought pieces on using NotebookLM for research and writing workflows.

Role in the Wiki#

Johnson is the primary voice explaining NotebookLM’s design philosophy — particularly the “pin → organize → structure” workflow and the deliberate separation of human-written vs. AI-generated notes for provenance tracking.

strAIght talk: AI Tips for Amazonians (Podcast Notes)#

Original | Raw

Summary#

Notes from the “strAIght talk” podcast — real Amazon employees sharing tactics they actually use with AI at work. Execution-focused, no theory. The core thesis: AI isn’t just a tool, it’s a workflow replacement layer.

Key Takeaways#

One prompt can replace entire workflows: People are collapsing tools (Slack, email, docs, planning) into a single AI interface. Tasks that took hours → minutes. The shift: from tool-driven work to prompt-driven work where AI orchestrates everything.
The “daily prompt” is the real leverage: High performers build a repeatable daily prompt anchored around priorities, constraints, and context. Acts like “a lightweight operating system for your day.”
Context beats clever prompting: Best results come not from smarter prompts but from feeding AI your role, goals, and constraints. Maintaining a “context document” is a recurring technique across episodes.
AI as thinking partner, not just executor: Ask AI questions, let AI ask you questions back. This is where the real leverage happens — better decisions, not just faster output.
Treat prompts like code: Version them, refine them, reuse them. One guest needed ~20 iterations to get a reliable research framework.

Three Core Moves#

Build a personal AI operating loop: Morning (what matters today?) → During (execute + refine) → End (summarize + improve)
Stop using AI like search: Delegate thinking loops to AI, not just tasks
Treat prompts like code: Version, refine, reuse

Connections#

pai: PAI’s TELOS system (MISSION.md, GOALS.md, etc.) is exactly the “context document” technique this podcast describes — persistent context about who you are and what you’re working toward. PAI’s “observe → think → plan → execute → verify → learn → improve” loop mirrors the “personal AI operating loop.”
fabric: Fabric’s Patterns are the “treat prompts like code” idea at scale — 251+ versioned, reusable prompts organized by task. The podcast’s advice to version and refine prompts is what Fabric systematizes.
llm-wiki-pattern: The wiki pattern is a form of the “context document” technique — persistent, compounding context that the AI reads at the start of every interaction.
context-management: The “context beats clever prompting” insight validates the progressive disclosure approach — what matters is getting the right context loaded, not crafting the perfect prompt.
prompt-engineering-patterns: The “daily prompt” template is a concrete instance of the structured prompting patterns (system message with role, constraints, success criteria).

Anthropic: Define Success Criteria and Build Evaluations#

Original | Raw

Summary#

anthropic’s canonical guide to building evaluations for LLM applications. Establishes the methodology that the Caparas article and the broader eval ecosystem builds on. Covers success criteria design, eval types (exact match, cosine similarity, LLM-graded), and the principle that automated volume beats hand-graded quality.

Key Takeaways#

Success criteria must be SMART: Specific, Measurable, Achievable, Relevant. “The model should classify sentiments well” is bad. “F1 score of at least 0.85 on 10,000 diverse tweets” is good.
Common criteria dimensions: Task fidelity, consistency, relevance/coherence, tone/style, privacy preservation, context utilization, latency, price. Most use cases need multidimensional evaluation.
Three eval design principles: (1) Be task-specific — mirror real-world distribution including edge cases. (2) Automate when possible — structure for automated grading. (3) Prioritize volume over quality — more questions with automated grading beats fewer with human grading.
Eval types by complexity:
- Exact match: Binary correct/incorrect. Best for categorical tasks.
- Cosine similarity: Semantic similarity between embeddings. Best for consistency testing.
- LLM-as-judge: Use a model to grade another model’s output against a rubric.
Even “hazy” topics can be quantified: Ethics and safety can be measured — e.g., “less than 0.1% of outputs flagged for toxicity out of 10,000 trials.”
Edge cases matter: Irrelevant input, overly long input, harmful user input, ambiguous cases where even humans disagree.

Connections#

skill-evaluation: This guide provides the foundational methodology. The three-tier framework (deterministic → LLM-judge → human) from evaluating-agent-skills-caparas is a direct application of these principles to skills specifically.
how-to-eval-a-skill: Our practical guide extends Anthropic’s prompt eval methodology to the five surfaces of skill evaluation (routing, tool selection, process, side effects, output quality).
ten-pillars-agentic-skill-design: Pillar 7 (Testing and Validation) is operationalized by this guide’s methodology.

Anthropic Skills Repository & Agent Skills Spec#

Original | Spec | Raw

Summary#

anthropic’s official skills repository for Claude, plus the agent-skills-standard specification. Skills are folders of instructions, scripts, and resources that Claude loads dynamically. The repo contains 17 skills across creative, technical, enterprise, and document categories — including the production document skills that power Claude.ai’s file capabilities.

Key Takeaways#

Agent Skills is an open standard: Defined at agentskills.io, not Anthropic-proprietary. Designed for cross-tool compatibility. claude-code extends it with invocation control, subagent execution, and dynamic context injection.
Progressive disclosure is the core design principle: Metadata (~100 tokens) loaded at startup for all skills. Full instructions (<5000 tokens recommended) loaded only when activated. Resources loaded only when needed. This is how skills scale without blowing up context.
SKILL.md is the only required file: YAML frontmatter (name + description) + markdown instructions. Optional: scripts/, references/, assets/ directories. Keep under 500 lines.
Document skills are production code: The docx, pdf, pptx, xlsx skills power Claude.ai’s document capabilities. Source-available (not open source) — shared as reference for complex skill patterns.
Plugin marketplace for Claude Code: /plugin marketplace add anthropics/skills registers the repo. Skills installable individually.
Partner ecosystem emerging: Notion has published skills for Claude. The allowed-tools field (experimental) hints at future tool-level permission control per skill.

BMad Method: AI-Driven Agile Development#

Original | Docs | Raw

Summary#

bmad-method is an AI-driven agile development framework with scale-adaptive intelligence — it adjusts planning depth from bug fixes to enterprise systems. Uses 12+ specialized agent personas (PM, Architect, Developer, UX, etc.) with structured workflows grounded in agile best practices. Modular ecosystem with official modules for testing, game dev, creative intelligence, and custom agent building. MIT licensed.

Key Takeaways#

Agents as expert collaborators, not autopilots: “Traditional AI tools do the thinking for you, producing average results. BMad agents act as expert collaborators who guide you through a structured process to bring out your best thinking.” This is a fundamentally different philosophy from kiro’s autonomous frontier agents.
Scale-Domain-Adaptive: Automatically adjusts planning depth based on project complexity. A bug fix doesn’t need the same ceremony as an enterprise system. This addresses the “sometimes agents aren’t needed” finding from ten-pillars-agentic-skill-design.
12+ specialized agent personas: PM, Architect, Developer, UX, and more. Each is a domain expert with a specific role. Related to skills-pipeline-sleestk’s persona-driven skills and pai’s agent personalities.
Party Mode: Bring multiple agent personas into one session to collaborate and discuss. A novel approach to multi-agent interaction — not parallel execution (like scion) but collaborative dialogue within a single session.
Modular ecosystem: Core framework (BMM, 34+ workflows) + official modules: BMad Builder (create custom agents), Test Architect (risk-based testing), Game Dev Studio, Creative Intelligence Suite. Extensible like agent-skills-standard but at the methodology level.
bmad-help skill: Invoke anytime for guidance on what’s next. Context-aware — knows your project state and installed modules. This is the “AI as thinking partner” pattern from ai-technique-podcast.
Complete lifecycle: Brainstorming → analysis → architecture → implementation → deployment. Broader than spec-kit’s spec-focused workflow.
Structured workflows over prompts: 34+ workflows grounded in agile best practices. Not just prompt templates — structured processes with phases, checkpoints, and handoffs.

Comparison with Spec Kit#

	spec-kit	BMad Method
Focus	Spec-driven (specs as primary artifact)	Agile-driven (structured workflows as primary process)
Agents	Agent-agnostic (30+ supported)	12+ specialized personas built in
Scaling	Same workflow for all projects	Scale-adaptive (adjusts depth to complexity)
Collaboration	Single agent executes steps	Party Mode (multi-persona dialogue)
Extensions	50+ community extensions	Official modules (testing, game dev, creative)
Philosophy	“Specs before code”	“Expert collaboration over autopilot”

Both operate at the methodology layer. Spec Kit is more prescriptive (seven fixed steps). BMad is more adaptive (adjusts to project scale).

Mon, 01 Jan 0001 00:00:00 +0000

Claude Code Documentation#

Original | GitHub | Raw

Summary#

Comprehensive documentation for claude-code, anthropic’s agentic coding tool. Terminal-native CLI that understands your codebase, edits files, runs commands, and handles git workflows through natural language. Deeply extensible via mcp-protocol, plugins, skills, hooks, and custom subagents.

Key Takeaways#

Agentic loop architecture: Three phases — gather context, take action, verify results — chained together with course-correction. Claude Code is the “agentic harness” that provides tools, context management, and execution environment around the Claude model.
Five tool categories: File operations, search, execution (shell), web, and code intelligence. This is the foundation; everything else extends it.
Dual memory system: CLAUDE.md (human-written instructions, scoped to org/project/user/local) + auto memory (Claude-written learnings, per working tree). Both loaded at session start. Target under 200 lines per CLAUDE.md.
Six permission modes: From default (reads only) to bypassPermissions (everything). auto mode uses a separate classifier model to review actions — a research preview. This is the trust/autonomy dial.
Subagents with isolation: Each subagent gets its own context window, system prompt, tool access, and permissions. Built-in: Explore (Haiku, read-only), Plan (research), General-purpose (all tools). Custom subagents via markdown files. Cannot spawn sub-subagents.
Skills as open standard: Follows Agent Skills (agentskills.io). SKILL.md files with YAML frontmatter. Bundled skills include /batch (parallel changes across codebase in git worktrees), /simplify, /loop, /debug.
25+ hook events: Shell commands, HTTP endpoints, or LLM prompts at lifecycle points. Can block actions (PreToolUse deny), inject context, automate workflows. Extremely granular.
Three execution environments: Local (your machine), Cloud (Anthropic VMs), Remote Control (local machine, browser UI).
Session portability: Move between terminal, desktop, web, mobile, Slack. Same CLAUDE.md and MCP servers everywhere. Resume, fork, parallel sessions via git worktrees.
AGENTS.md compatibility: Claude Code reads CLAUDE.md, not AGENTS.md, but can import AGENTS.md for cross-tool compatibility.

How to Evaluate AI Agent Skills Without Relying on Vibes#

Original | Raw

Author: JP Caparas (building on OpenAI’s “Testing Agent Skills Systematically with Evals”)

Summary#

A practical guide to moving from “it feels better” to “I have proof” when evaluating AI agent skills. Proposes a three-tier evaluation framework (deterministic → LLM-as-judge → human review) with concrete economics, and argues that the industry’s convergence on JSON Schema skill formats makes these principles platform-agnostic.

Mon, 01 Jan 0001 00:00:00 +0000

Fabric GitHub Repository#

Original | Raw

Author: daniel-miessler

Summary#

fabric is an open-source framework (Go, MIT) for augmenting humans using AI. Its core contribution is Patterns — 251+ curated, well-structured prompts organized by real-world task. Also implements composable prompt strategies (CoT, ToT, Reflexion, etc.) as modifiers on top of patterns. Model-agnostic with 30+ provider integrations.

Key Takeaways#

Patterns as the fundamental unit: Fabric’s insight is that AI has a capabilities problem but an integration problem. Patterns solve this by packaging prompts as reusable, discoverable, shareable units organized by task. 251+ patterns covering everything from YouTube extraction to academic paper summarization.
Pattern design principles: Markdown for readability, extremely clear instructions, System section almost exclusively. Each pattern is a directory with a system.md file.
Composable strategies: Nine prompt strategies (CoT, CoD, ToT, AoT, LtM, self-consistent, self-refine, reflexion, standard) applied as modifiers on top of any pattern. Stored as JSON. This separates what to do (pattern) from how to reason (strategy).
CLI-first, Unix philosophy: Pipe input, compose with other tools. echo "input" | fabric --strategy cot -p analyze_code. Shell aliases turn each pattern into a command.
Model-agnostic: 30+ providers. Per-pattern model mapping via environment variables.
Obsidian integration: Save output as dated markdown files — directly relevant to the llm-wiki-pattern.
Community-driven: Open source pattern library. The “wisdom of crowds” approach to prompt curation.

Connections#

Patterns are conceptually similar to skills in the agent-skills-standard — both are curated prompt packages organized by task. But patterns are simpler (just a system.md) while skills support scripts, references, assets, and progressive disclosure.
Fabric’s strategies map directly to the prompt-engineering-patterns described in academic literature and the ten-pillars-agentic-skill-design framework (Pillar 5).
Fabric is model-agnostic like scion is harness-agnostic — both abstract over the underlying AI provider.

Kiro Autonomous Agent Page#

Original | Raw

Summary#

Product page and FAQ for kiro’s autonomous agent feature — a frontier-agent that works independently on development tasks in the background. It takes high-level task descriptions, plans implementation, writes code across multiple repositories, runs tests, and creates pull requests, all asynchronously without requiring an active session.

Key Takeaways#

Asynchronous by design: Unlike the Kiro IDE (interactive pair programming) and Kiro CLI (local custom agents), the autonomous agent runs in isolated sandbox environments in the background. You assign tasks from kiro.dev or GitHub.
Multi-repo coordination: Can plan a change once and create coordinated edits and PRs across multiple repositories — not just one repo at a time.
Learns from feedback: Maintains persistent context across tasks, repos, and PRs. Uses code review feedback to shape future changes. This is a key differentiator from stateless agent approaches.
Never auto-merges: Always creates PRs for human review. Safety-first approach.
Sub-agent coordination: Coordinates specialized sub-agents to complete complex development work.
Team features: Integrates with Jira, Confluence, GitLab, GitHub, Teams, Slack. Handles routine fixes and follow-ups to protect engineer focus time.
“Frontier agent” branding: aws positions this as a new class of agent — autonomous, massively scalable, works independently for hours or days.
Preview status: Rolling out to Pro/Pro+/Power users. Free during preview with weekly limits.

Scope#

Product marketing page with FAQ. No deep technical architecture details — this is the public-facing description of capabilities and positioning.

Mon, 01 Jan 0001 00:00:00 +0000

LLM Wiki (Karpathy)#

Original | Raw

Author: andrej-karpathy

Summary#

The foundational idea file for the LLM Wiki pattern — the methodology this entire wiki is built on. Proposes that instead of RAG (re-deriving knowledge on every query), LLMs should incrementally build and maintain a persistent, interlinked wiki from raw sources. The wiki is a compounding artifact: cross-references already exist, contradictions are flagged, synthesis reflects everything ingested.

Key Ideas#

Wiki > RAG: RAG rediscovers knowledge from scratch every query. A wiki compiles knowledge once and keeps it current. The synthesis compounds.
Three-layer architecture: Raw sources (immutable) → Wiki (LLM-maintained markdown) → Schema (CLAUDE.md/AGENTS.md defining conventions and workflows).
Three operations: Ingest (process source → update 10-15 pages), Query (search index → synthesize answer → optionally file back), Lint (health-check for contradictions, orphans, gaps).
Index + Log: index.md is content-oriented (catalog for navigation), log.md is chronological (append-only timeline). Index-first navigation works at moderate scale without embedding infrastructure.
Human role vs. LLM role: Human curates sources, directs analysis, asks questions, thinks about meaning. LLM does everything else — summarizing, cross-referencing, filing, bookkeeping.
Why wikis fail and this doesn’t: Humans abandon wikis because maintenance burden grows faster than value. LLMs don’t get bored, don’t forget cross-references, can touch 15 files in one pass. Maintenance cost ≈ zero.
Memex lineage: Related to Vannevar Bush’s Memex (1945) — personal knowledge store with associative trails. Bush couldn’t solve who does the maintenance. The LLM handles that.

Meta: This Wiki’s Relationship to This Source#

This wiki is a direct instantiation of Karpathy’s LLM Wiki pattern. Our CLAUDE.md schema implements the three-layer architecture. Our ingest/query/lint workflows follow the operations described. Our index.md and log.md serve exactly the roles specified. The pattern is the blueprint; this wiki is a running instance.

Mon, 01 Jan 0001 00:00:00 +0000

Getting The Most Out Of Notes In NotebookLM#

Original | Raw

A how-to guide by steven-johnson, co-founder and editorial director of notebooklm, published March 2024. Part of a series on using the Gemini-powered writing and research tool from Google Labs.

Key Takeaways#

Two note types with clear provenance: Written Notes (user-authored, editable) and Saved Responses (AI-generated or source quotes, immutable). The immutability of Saved Responses is a deliberate design choice for tracking what the human wrote vs. what the AI wrote.
Pin-then-organize workflow: Have an exploratory conversation with the AI grounded in your sources, pin interesting responses as you go, then revisit and structure later. Brainstorming first, organizing second.
Reading-integrated note-taking: Selecting text in a source triggers AI actions — Add to Note, Summarize to Note, Help Me Understand, Suggest Related Ideas. The AI enhances the reading process, not just the writing process.
5,000-word note query limit: Selected notes exceeding 5,000 words cannot be queried via the chatbox. Workaround: combine all notes into one, then re-add as a source — converting notes into a queryable source.
Cross-notebook portability: The combine-and-copy pattern lets you move note collections between notebooks by pasting combined notes as a new source.

Design Philosophy#

NotebookLM’s notes system embodies a specific stance on AI-assisted knowledge work: the AI is grounded in user-curated sources (not the open web), conversations are exploratory, and the user captures what matters. This is closer to the llm-wiki-pattern’s “human curates, LLM processes” division of labor than to a general-purpose chatbot.

Mon, 01 Jan 0001 00:00:00 +0000

Paperclip: Open-Source Orchestration for Zero-Human Companies#

Original | Raw

Summary#

paperclip is an open-source Node.js server + React UI that orchestrates a team of AI agents to run a business. Not an agent framework — it’s the company layer above agents. Org charts, budgets, governance, goal alignment, and agent coordination. “If OpenClaw is an employee, Paperclip is the company.” MIT licensed.

Key Takeaways#

Company-level orchestration, not agent-level: Paperclip doesn’t build agents — it organizes them into companies with org charts, roles, reporting lines, budgets, and governance. This is a layer above scion (infrastructure), claude-code (tool), and kiro (product).
Bring your own agent: Works with Claude Code, Codex, Cursor, OpenClaw, Bash, HTTP. “If it can receive a heartbeat, it’s hired.” Agent-agnostic like scion, but at the business layer instead of the infrastructure layer.
Goal-aware execution: Every task traces back to the company mission. Tasks carry full goal ancestry so agents see the “why,” not just a title. This is pai’s TELOS concept (goal orientation) applied to multi-agent companies.
Heartbeat-based scheduling: Agents wake on a schedule, check work, and act. Delegation flows up and down the org chart. Similar to claude-code’s scheduled tasks but with organizational hierarchy.
Cost control as first-class: Monthly budgets per agent. Atomic budget enforcement — when they hit the limit, they stop. No runaway costs. Addresses the evaluation economics gap from evaluating-agent-skills-caparas.
Governance with rollback: Approval gates enforced, config changes revisioned, bad changes rolled back. “You’re the board.” This is the human-in-the-loop pattern from ten-pillars-agentic-skill-design (Pillar 6) at the organizational level.
Persistent agent state: Agents resume the same task context across heartbeats instead of restarting from scratch. Addresses the memory/context persistence challenge from context-management.
Runtime skill injection: Agents learn Paperclip workflows and project context at runtime. Related to agent-skills-standard’s progressive disclosure.
Clipmart (coming soon): Download and run entire companies with one click. Pre-built company templates — full org structures, agent configs, and skills. This is the marketplace concept from agent-skills-standard applied to entire organizations.
Multi-company isolation: One deployment, many companies. Complete data isolation. One control plane for a portfolio.

What Paperclip Is NOT#

Not a chatbot (agents have jobs, not chat windows)
Not an agent framework (doesn’t tell you how to build agents)
Not a workflow builder (no drag-and-drop pipelines)
Not a prompt manager (agents bring their own prompts)
Not a single-agent tool (this is for teams of 20+ agents)
Not a code review tool (orchestrates work, not PRs)

Connections#

scion: Both are multi-agent orchestration, but at different layers. Scion = infrastructure (containers, runtimes, harnesses). Paperclip = business (org charts, budgets, goals, governance). Paperclip could theoretically use Scion as its runtime layer.
multi-agent-orchestration: Paperclip introduces a fourth approach — company-level orchestration — beyond the infrastructure (Scion), product (Kiro), and tool (Claude Code) approaches already in the wiki.
pai: Both are goal-oriented. PAI’s TELOS (mission, goals, projects) maps to Paperclip’s company mission → project goals → task hierarchy. But PAI is personal; Paperclip is organizational.
context-management: Persistent agent state across heartbeats + goal ancestry flowing down = a form of hierarchical context management.
agent-skills-standard: Runtime skill injection + Clipmart (company templates) extends the skill concept to organizational blueprints.

Personal AI Infrastructure (PAI)#

Original | Raw

Author: daniel-miessler

Summary#

PAI is a personalized AI platform built natively on claude-code, designed to turn it from a stateless tool into a persistent assistant that knows your goals, preferences, and history. It adds memory, skills, hooks, security, voice, and a “TELOS” goal system on top of Claude Code’s primitives. Open source (MIT), TypeScript/Bash. v4.0.3 as of March 2026.

Key Takeaways#

Three levels of AI: Chatbots (ask→answer→forget) → Agentic platforms (ask→use tools→get result) → PAI (observe→think→plan→execute→verify→learn→improve). The key differentiator is the learn step — continuous feedback capture and self-improvement.
Goal-oriented, not task-oriented: PAI’s primary focus is the human and their goals, not the tech. TELOS system: 10 files capturing who you are (MISSION.md, GOALS.md, PROJECTS.md, BELIEFS.md, MODELS.md, STRATEGIES.md, NARRATIVES.md, LEARNED.md, CHALLENGES.md, IDEAS.md).
Built on Claude Code, not replacing it: “Claude Code is the engine. PAI is everything else that makes it your car.” Uses Claude Code’s hooks, slash commands, MCP servers, context files as building blocks.
16 design principles: User centricity, foundational algorithm (scientific method loop), scaffolding > model, deterministic infrastructure, code before prompts, UNIX philosophy, CLI as interface, skill management, memory system, agent personalities, and more.
Primitives (architecture):
- TELOS: Deep goal understanding via 10 structured files
- User/System separation: USER/ (your stuff, upgrade-safe) vs SYSTEM/ (PAI infra)
- Skill system: Deterministic outcomes first — CODE → CLI → PROMPT → SKILL hierarchy
- Memory system: Three-tier (hot/warm/cold), continuous learning from ratings/sentiment/outcomes
- Hook system: 8 event types for lifecycle automation
- Security system: Policy-based, no need for --dangerously-skip-permissions
- Voice system: ElevenLabs TTS with prosody enhancement
- Notification system: ntfy push, Discord, duration-aware routing
Packs: Standalone, AI-installable capability modules (Research, Security, Thinking, Media, etc.) that work without full PAI installation.
Scale: v4.0.0 has 63 skills, 21 hooks, 180 workflows, 14 agents.
Relationship to Fabric: “Fabric is a collection of AI prompts (patterns) for specific tasks. PAI is infrastructure for how your DA operates. They’re complementary.”

Connections#

llm-wiki-pattern: PAI’s memory system and TELOS are a different instantiation of the same insight — persistent, compounding knowledge. The wiki pattern compiles knowledge from external sources; PAI compiles knowledge about you.
claude-code: PAI is the most ambitious layer built on top of Claude Code’s primitives. It validates that Claude Code’s hooks, skills, and memory are sufficient building blocks for a full personal AI platform.
fabric: Same author. Fabric = patterns (what to ask). PAI = infrastructure (how the AI operates). Complementary. Many PAI users integrate Fabric patterns.
agent-skills-standard: PAI’s skill system follows a CODE → CLI → PROMPT → SKILL hierarchy — more opinionated than the Agent Skills spec, with deterministic outcomes prioritized.
ten-pillars-agentic-skill-design: PAI implements many of the ten pillars: architecture (Pillar 1), scope (Pillar 3), modularity (Pillar 4), tool integration (Pillar 6), testing (Pillar 7), versioning (Pillar 8), and anti-patterns (Pillar 10).
context-management: PAI’s three-tier memory (hot/warm/cold) is a concrete implementation of context management strategies.

Promptfoo: LLM Evals & Red Teaming#

Original | Docs | Raw

Summary#

Open-source CLI and library for evaluating and red-teaming LLM apps. Now part of OpenAI. MIT licensed. The closest thing to a turnkey skill eval tool — YAML-based test cases, CI/CD integration, model comparison, and red teaming. Powers LLM apps serving 10M+ users in production.

Key Takeaways#

Developer-first: Fast, with live reload and caching. Runs 100% locally — prompts never leave your machine.
YAML-based test cases: Define inputs, expected outputs, and grading criteria in YAML. Directly maps to the eval.yaml format proposed in how-to-eval-a-skill.
Multiple grading methods: Exact match, contains, regex, LLM-as-judge, custom functions. Covers both deterministic (Tier 1) and LLM-graded (Tier 2) evaluation.
CI/CD integration: Run evals in GitHub Actions, block merges on failures. This is the “eval on every commit” pattern from evaluating-agent-skills-caparas.
Red teaming: Vulnerability scanning for prompt injection, jailbreaks, and other attacks. Relevant to ten-pillars-agentic-skill-design Pillar 6 (security).
Model comparison: Side-by-side comparison across providers (OpenAI, Anthropic, Azure, Bedrock, Ollama). Useful for the per-pattern model mapping that fabric supports.
Now part of OpenAI: Acquired but remains open source and MIT licensed.

Connections#

skill-evaluation: Promptfoo is the most practical tool for implementing the three-tier eval framework. Its YAML test cases + CI/CD integration + LLM-as-judge support covers Tiers 1 and 2.
how-to-eval-a-skill: The eval.yaml format we proposed could be implemented directly in Promptfoo’s YAML config format.
agent-skills-standard: An evals/ directory in the skill structure could contain Promptfoo config files, making skills self-evaluating.

Scion Documentation#

Original | Raw

Summary#

Scion is an experimental multi-agent orchestration testbed by google-cloud-platform, designed to manage concurrent LLM-based agents running in containers across local machines and remote kubernetes clusters. It acts as a “hypervisor for agents” — not a full multi-agent framework, but an infrastructure layer for running, isolating, and managing agent processes.

Key Takeaways#

Manager-Worker architecture: A host-side CLI (scion) orchestrates agent containers. Each agent runs an LLM + harness loop in isolation.
Harness-agnostic: Supports gemini-cli, claude-code, opencode, and codex through a common adapter interface. New harnesses can be added via a plugin-system.
Two operating modes: Solo mode (local Docker/Podman, zero-config) and Hosted mode (centralized hub dispatching to runtime-brokers).
Strict isolation: Each agent gets its own filesystem, credentials, environment variables, and git workspace. Shadow mounts (tmpfs) prevent cross-agent access.
Git-native workspaces: Local mode uses git worktrees; hosted mode uses git init + fetch. Each agent works on its own branch.
templates as blueprints: Agents are created from templates that define system prompts, tools, images, and configuration. Templates support inheritance.
agent-state-model: Three-dimensional state tracking — Phase (lifecycle), Activity (cognitive state), Detail (freeform context).
Philosophy: Favors isolation over constraints, interaction over autonomy, diversity of models/harnesses, and action over planning. Runs agents in --yolo mode with infrastructure-level guardrails.

Scope#

The documentation covers: architecture, core concepts, all four supported harnesses, configuration system, Hub/Broker distributed architecture, workspace strategies, security model, plugin system, and the Go package structure.

Mon, 01 Jan 0001 00:00:00 +0000

Skills Pipeline (Sleestk)#

Original | Obsidian Skill

Summary#

A collection of Claude skills organized as multi-stage pipelines. Three skill sets: a six-stage YouTube video production pipeline, a four-skill SaaS development stack, and a comprehensive Obsidian power user skill. Demonstrates how skills can be chained — each stage takes the previous stage’s output as input.

Key Takeaways#

Skills as pipelines, not standalone units: The YouTube pipeline chains 6 skills sequentially: Research → Script → SEO → Visual Director → Editor Brief → Thumbnail. Each skill’s output is the next skill’s input. This is a concrete implementation of the multi-skill pipeline pattern described in ten-pillars-agentic-skill-design (Pillar 9: Context Management Recipes).
Follows the Agent Skills format: Each skill is a directory with a SKILL.md file using YAML frontmatter (name, description) — exactly the agent-skills-standard spec. The Obsidian skill also uses references/ subdirectory for progressive disclosure.
Progressive disclosure in practice: The Obsidian skill loads reference files on demand (“Read the relevant reference file(s) before responding”). 8 reference files covering editing, linking, canvas, bases, plugins, publish, vault architecture. Quick-reference syntax is inline; deep details are deferred. This is the agent-skills-standard’s progressive disclosure pattern working in production.
Persona-driven skills: Each skill defines a persona (“seasoned Obsidian knowledge architect”, “expert Next.js developer”). The persona sets tone, output standards, and a core rule. This maps to the agent persona concept in ten-pillars-agentic-skill-design (Pillar 5).
Output format standards as tables: The Obsidian skill defines exact output formats per type (notes → markdown, canvas → JSON, bases → YAML, folder structures → tree + bash script). This is deterministic output specification — related to skill-evaluation’s outcome goals.
Decision logic before responding: The Obsidian skill has explicit decision logic: identify output type → load reference files → produce complete output → handle multi-category requests. This is a lightweight version of the agentic loop.
Test prompts included: 10 validation prompts built into the skill itself. This is inline evaluation — the skill ships with its own test cases, directly implementing skill-evaluation’s “start small with a targeted prompt set” advice.

The Three Skill Sets#

YouTube Pipeline (6 stages)#

research-agent.md → Research Brief
script-agent.md → Production-ready script
seo-agent.md → YouTube metadata package
visual-director.md → Visual Production Brief
editor-brief.md → Complete editing guide
thumbnail-agent.md → Thumbnail Creative Brief

SaaS Stack (4 skills)#

nextjs-developer — Next.js 16.2.1 expert
supabase-js — Full-stack Supabase developer
stripe-developer — Stripe payment integrations
vercel-developer — Deploy and manage on Vercel

Obsidian Power User (1 skill + 8 references)#

Comprehensive Obsidian expert covering vault design, canvas, bases, Dataview, Templater, all core plugins, Publish, Web Clipper, CSS snippets, URI links, and MOC notes.

Mon, 01 Jan 0001 00:00:00 +0000

Spec Kit: Spec-Driven Development Toolkit#

Original | Raw

Summary#

spec-kit is an open-source toolkit by GitHub for Spec-Driven Development (SDD) — a methodology where specifications become executable, directly generating working implementations rather than just guiding them. Provides a CLI (specify) and a set of slash commands that work across 30+ AI agents. MIT licensed.

Key Takeaways#

Specifications as the primary artifact: SDD flips the script — specs define the “what” before the “how.” Code is generated from specs, not the other way around. This is the opposite of “vibe coding.”
Seven-step workflow: Constitution (principles) → Specify (requirements) → Clarify → Plan (tech stack) → Tasks (breakdown) → Implement → Verify. Each step produces a persistent artifact.
Agent-agnostic: Supports 30+ AI agents including claude-code, Codex, Gemini CLI, Cursor, Kiro CLI, Copilot, and many more. Uses slash commands (/speckit.specify) or agent skills ($speckit-specify).
Extension ecosystem: 50+ community extensions covering docs, code review, process orchestration, integrations (Jira, Azure DevOps, Confluence, GitHub Projects), visibility, and security. Categories: docs, code, process, integration, visibility.
Presets for customization: Override templates and commands without changing tooling. Stack multiple presets with priority ordering. Examples: pirate speak, compliance-oriented formats, domain-specific terminology.
Constitution as governance: A constitution.md file establishes project principles that guide all subsequent development — similar to claude-code’s CLAUDE.md but focused on project governance rather than agent behavior.
Feature branching built in: Each feature gets its own branch (001-create-taskify), spec directory, plan, and task breakdown. Git-native workflow.
Clarification before planning: /speckit.clarify runs structured questioning to reduce rework downstream. “Do not treat the first attempt as final.”
Task breakdown with dependency management: Tasks ordered by dependency, parallel execution markers ([P]), file path specifications, TDD structure, checkpoint validation.
Skills integration: Installs as Claude Code skills (.claude/skills/) or Codex skills (.agents/skills/). Follows the agent-skills-standard pattern.

Connections#

agent-skills-standard: Spec Kit installs its commands as skills following the standard. Each /speckit.* command is effectively a skill with a specific workflow.
ten-pillars-agentic-skill-design: SDD operationalizes several pillars — Architecture (Pillar 1: structured artifacts), Documentation (Pillar 2: specs as documentation), Scope (Pillar 3: one feature per branch), Testing (Pillar 7: verify step), Version Control (Pillar 8: feature branching).
claude-code: First-class integration. Skills installed in .claude/skills/. Constitution parallels CLAUDE.md.
multi-agent-orchestration: Agent-agnostic like scion, but at the methodology layer. The workflow is the same regardless of which agent executes it.
context-management: The seven-step workflow is progressive disclosure applied to development — each step produces a focused artifact that feeds the next, rather than dumping everything into one prompt.
skill-evaluation: The /speckit.analyze command (cross-artifact consistency check) and /speckit.checklist (quality validation) are built-in evaluation mechanisms — “unit tests for English.”
paperclip: Both address the “above the agent” layer, but differently. Paperclip = organizational orchestration (org charts, budgets). Spec Kit = methodological orchestration (specs, plans, tasks).
fabric: Spec Kit’s slash commands are like Fabric’s patterns — reusable, named workflows. But Spec Kit’s are sequential (a pipeline) while Fabric’s are independent.

The Ten Pillars of Agentic Skill Design#

Original

Author: Ian Forster (with support of Kiro) Version: 2.0 | November 2024

Summary#

A research paper proposing a comprehensive ten-pillar framework for designing agentic skills files — the modular extensions that encapsulate domain knowledge, workflows, and tool integrations for AI agents. Synthesizes software engineering principles, prompt engineering best practices, and analysis of 4,476+ GitHub repositories to address the lack of standardized design methodologies.

Activity Log

Mon, 01 Jan 0001 00:00:00 +0000

Wiki Log#

[2026-04-07] create | Wiki initialized#

Wiki structure created: CLAUDE.md schema, wiki/index.md, wiki/log.md, directory scaffolding. Ready for first ingest.

[2026-04-07] ingest | Scion Documentation#

Source: https://googlecloudplatform.github.io/scion — multi-agent orchestration testbed by Google Cloud Platform. Created 13 wiki pages:

Source: scion-docs
Entities: scion, google-cloud-platform
Concepts: agent, agent-state-model, grove, harness, hub, template, runtime, runtime-broker, plugin-system, multi-agent-orchestration

[2026-04-07] ingest | Kiro Autonomous Agent#

Source: https://kiro.dev/autonomous-agent/ — AWS’s frontier agent for autonomous development tasks. Created 6 wiki pages, updated 1:

Source: kiro-autonomous-agent
Entities: kiro, aws
Concepts: frontier-agent, kiro-powers
Updated: multi-agent-orchestration — added Kiro as second approach, comparison with Scion

[2026-04-07] ingest | Claude Code Documentation#

Source: https://code.claude.com/docs/en — Anthropic’s agentic coding tool. Created 4 wiki pages, updated 2:

LLM Wiki — Agentic AI Landscape

Cross-Source Theme Analysis#

How to Eval a Skill (Practical Guide)#

The Key Difference: Prompts vs. Skills#

Key Insights: The Agentic AI Landscape (April 2026)#

1. Five Layers Are Emerging#

Evidence Map: Supporting the Ten Pillars Framework#

Pillar 1: Architecture and Structure#

Agent Skills Standard#

Specification#

Required Frontmatter#

Optional Frontmatter#

Progressive Disclosure#

Agent State Model#

Phase (Lifecycle)#

Activity (Cognitive State)#

Detail (Freeform Context)#

Agent#

Properties#

State Model#

Context Management#

Why It Matters#

Progressive Disclosure#

Frontier Agent#

Distinction from Regular Agents#

Grove#

Scope#

Identification#

Contents#

Harness#

Purpose#

Interface#

Supported Harnesses#

Capability Matrix#

Extensibility#

Hub#

Responsibilities#

Communication with Brokers#

Agent Creation Flow (Hosted)#

See Also#

Kiro Powers#

Contents#

Purpose#

See Also#

LLM Wiki Pattern#

Core Insight#

MCP (Model Context Protocol)#

How Claude Code Uses MCP#

Usage Across the Ecosystem#

Significance#

Multi-Agent Orchestration#

Three Approaches Emerging#

Infrastructure-first: Scion#

Product-first: Kiro Autonomous Agent#

Plugin System#

Plugin Types#

Status#

See Also#

Prompt Engineering Patterns#

Chain of Thought (CoT)#

ReAct Pattern#

Self-Reflection (Reflexion)#

Runtime Broker#

Responsibilities#

Communication#

Examples of Broker Nodes#

Runtime#

Supported Runtimes#

Runtime Selection#

Skill Evaluation#

The Problem#

Three-Tier Framework#

Template#

Contents#

Inheritance#

Scopes#

Management#

Defaults#

Andrej Karpathy#

Relevant Work#