AI Summaryis an eval workbench for agent skills. It runs a model in an isolated Docker directory, provides skills/references as normal workspace files, captures an agent trace, and grades deterministic local outcomes. Use this skill as the source of truth for authoring eval suites in this repo. Detailed sche
Install
Copy this and paste it into Claude Code, Cursor, or any AI assistant:
I want to install the "skill-optimizer" skill in my project. Please run this command in my terminal: # Install skill into your project mkdir -p .claude/skills/skill-optimizer && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/skill-optimizer/SKILL.md "https://raw.githubusercontent.com/fastxyz/skill-optimizer/development/skills/skill-optimizer/SKILL.md" Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.
Description
Use when creating, running, debugging, or documenting skill-optimizer workbench evals; working with agent skill cases, suites, graders, traces, Docker workspaces, OpenRouter model matrices, or the skill-optimizer SDK/CLI.
Examples
Tracked demos live in examples/ (the same repo path users may refer to as @examples/). Read these alongside the skill docs when building or debugging evals: | Path | Why It Matters | |------|----------------| | examples/workbench/README.md | Short command walkthrough for demos | | examples/workbench/pdf/README.md | Explains the PDF demo cases and expected outputs | | examples/workbench/pdf/suite.yml | Concrete suite using models, setup, env, graders, and append prompt | | examples/workbench/pdf/references/pdf-skill/SKILL.md | Example skill copied into /work for the agent | | examples/workbench/pdf/checks/*.mjs | Deterministic grader and setup helper patterns | | examples/workbench/mcp/suite.yml | Hidden-service MCP calculator example | | examples/workbench/mcp/mcp/calculator-server.mjs | Example MCP server with add/subtract/multiply/divide tools | `bash npx tsx src/cli.ts run-suite examples/workbench/pdf/suite.yml --trials 1 npx tsx src/cli.ts run-suite examples/workbench/mcp/suite.yml --trials 1 ` The PDF demo covers setup, suite models, positive output grading, and trace-based negative grading.
skill-optimizer
skill-optimizer is an eval workbench for agent skills. It runs a model in an isolated Docker /work directory, provides skills/references as normal workspace files, captures an agent trace, and grades deterministic local outcomes. Use this skill as the source of truth for authoring eval suites in this repo. Detailed schema and patterns are in references/workbench.md.
Core Model
• A case is one user-like task plus one or more deterministic graders. • A suite is a set of cases and OpenRouter models to run as a matrix. • references are copied into /work before the agent starts; this is where eval skills live. • The agent phase sees /work only. It cannot see /case, /results, graders, hidden answers, or hidden metadata. • Cases can define mcpServers; these are exposed through a workbench mcp command during the agent phase. • Graders run after the agent with /case, /work, and /results mounted. • trace.jsonl is the debugging source for what the agent saw, said, and did.
Commands
| Goal | Command | |------|---------| | Install deps | npm install | | Build CLI | npm run build | | Run one case | npx tsx src/cli.ts run-case <case.yml> | | Run one case across models | npx tsx src/cli.ts run-case <case.yml> --models openrouter/google/gemini-2.5-flash,openrouter/openai/gpt-5.4 | | Run a suite | npx tsx src/cli.ts run-suite <suite.yml> | | CLI help | npx tsx src/cli.ts --help | Rules: • Use only openrouter/... model refs. • OPENROUTER_API_KEY is required for real model runs. • run-suite uses models: from suite.yml; it has no model override flag. • run-case can use its case model: or --model / --models. • Docker image default is skill-optimizer-workbench:local.
Discussion
Health Signals
My Fox Den
Community Rating
Sign in to rate this booster