27 boosters for "evaluation" — open source, verified from GitHub, ready to install
Corporate Training Designer is an AI agent that helps enterprises design and optimize training programs through needs analysis, instructional design, and effectiveness evaluation. HR leaders, L&D professionals, and training managers use it to create behavior-change-focused curricula and leadership development initiatives.
Test Results Analyzer is an AI agent that transforms raw test data into actionable quality insights through comprehensive metrics analysis and strategic reporting. QA engineers, test managers, and development teams use it to accelerate test result evaluation and drive continuous improvement.
Promptfoo is an LLM evaluation and testing toolkit that helps developers systematically test, benchmark, and validate prompt performance across different models and scenarios. It's essential for teams building LLM applications who need rigorous quality assurance and prompt optimization.
Train object detection, image classification, and SAM/SAM2 segmentation models on managed cloud GPUs. No local GPU setup required—results are automatically saved to the Hugging Face Hub. Use this skill when users want to: Helper scripts use PEP 723 inline dependencies. Run them with :
This skill is for running evaluations against models on the Hugging Face Hub on local hardware. It does not cover: If the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the skill and pass it one of the local scripts in this skill.
"name": "huggingface-skills", "description": "Agent Skills for AI/ML tasks including dataset creation, model training, evaluation, and research paper publishing on Hugging Face Hub", "name": "Hugging Face"
This skill automates the process of adding, extracting, and managing evaluation results in Hugging Face model cards, supporting multiple data sources including Artificial Analysis API and custom evaluations with vLLM/lighteval. It's valuable for ML practitioners and model maintainers who need to track and display model performance metrics.
Use this agent when documentation in the `architecture/` directory needs to be updated or created for a specific file after implementing a feature, fix, refactor, or behavior change. Launch one instance of this agent per file that needs updating. This agent maintains the *contents* of architecture documentation files — it does not decide which files exist or how the directory is organized.\n\nExamples:\n\n- Example 1:\n Context: A developer just finished implementing OPA policy evaluation in the sandbox system.\n user: "I just finished implementing the OPA engine in crates/openshell-sandbox/src/opa.rs. Update architecture/sandbox.md to reflect the new policy evaluation flow."\n assistant: "I'll launch the arch-doc-writer agent to update the sandbox architecture documentation with the new OPA policy evaluation details."\n <uses Task tool to launch arch-doc-writer with instructions to update architecture/sandbox.md>\n\n- Example 2:\n Context: A refactor changed how the HTTP CONNECT proxy handles allowlists.\n user: "The proxy allowlist logic was refactored. Please update architecture/proxy.md."\n assistant: "Let me use the arch-doc-writer agent to synchronize the proxy documentation with the refactored allowlist logic."\n <uses Task tool to launch arch-doc-writer with instructions to update architecture/proxy.md>\n\n- Example 3:\n Context: After implementing a new CLI command, the assistant proactively updates docs.\n user: "Add a --rego-policy flag to the CLI."\n assistant: "Here is the implementation of the --rego-policy flag."\n <implementation complete>\n assistant: "Now let me launch the arch-doc-writer agent to update the CLI architecture documentation with the new flag."\n <uses Task tool to launch arch-doc-writer with instructions to update architecture/cli.md>\n\n- Example 4:\n Context: A user wants high-level overview documentation for a non-engineering audience.\n user: "Update architecture/overview.md with a non-engineer-friendly explanation of the sandbox system."\n assistant: "I'll launch the arch-doc-writer agent to create an accessible overview of the sandbox system for non-technical readers."\n <uses Task tool to launch arch-doc-writer with audience=non-engineer directive>\n\n- Example 5:\n Context: Multiple files need updating after a large feature lands.\n user: "I just landed the network namespace isolation feature. Update architecture/sandbox.md and architecture/networking.md."\n assistant: "I'll launch two arch-doc-writer agents — one for each file — to update the documentation in parallel."\n <uses Task tool to launch arch-doc-writer for architecture/sandbox.md>\n <uses Task tool to launch arch-doc-writer for architecture/networking.md>
"name": "research-companion", "description": "Strategic research thinking agents — idea evaluation, project triage, and structured brainstorming inspired by Carlini's research methodology", "name": "Andre Huang",
Đóng vai Skill Architect — phỏng vấn thông minh để trích xuất quy trình từ đầu người dùng, sinh AI Skill hoàn chỉnh, rồi test và cải thiện liên tục cho đến khi đạt chất lượng production. Người dùng KHÔNG CẦN biết skill là gì.
"name": "prism-mcp-server", "mcpName": "io.github.dcostenco/prism-mcp", "description": "The Mind Palace for AI Agents — persistent memory (SQLite/Supabase), behavioral learning & IDE rules sync, multimodal VLM image captioning, pluggable LLM providers (OpenAI/Anthropic/Gemini/Ollama), OpenTelemetry
Use AskUserQuestion to ask the buyer: Tell the user the version was updated, then re-read the EVALUATION.md file from the updated directory and proceed with the skill. After the preamble, read the full evaluation methodology:
Brain in the Fish evaluates documents (essays, policies, contracts, clinical reports, surveys) against evaluation criteria using a panel of AI agents. Each agent's mental state exists as OWL ontology. Scoring is grounded in an Evidence Density Scorer (EDS) that makes hallucination mathematically det
"name": "digital-marketing-pro", "description": "Plan, execute, and measure digital marketing across all channels. 25 specialist agents handle strategy, SEO, paid ads, content, email, social, PR, analytics, CRO, and agency operations — with brand voice enforcement, quality evaluation, multilingual s
AgentAsJudge is an agentic evaluation framework that enables AI systems to critically review educational introductions by validating them against specified quality metrics and providing constructive feedback. It benefits educators, instructional designers, and developers building AI-assisted learning platforms who need reliable, fair assessment of educational content.
AgentAsJudge is an agentic evaluation framework that enables AI to systematically assess and compare the quality of multiple-choice questions across educational value, clarity, and answerability. It benefits educators, content creators, and assessment teams looking to automate quality control of exam and quiz questions.
"version": "5.10.0", "description": "Memory → Evaluation → Credential → Access Control for AI agents. Persistent memory with W3C Verifiable Credentials, capability-based access control, drift detection, and FSRS-6 spaced repetition.", "name": "kobie3717",
"name": "open-academic-paper-machine", "description": "Open Academic Paper Machine — Autonomous academic paper production system with idea evaluation gate and paper-vs-code audit. NEW in v6.4: /audit-paper command and audit-engine skill — static audit of a paper's empirical claims (datasets, models,
"name": "cre-skills", "description": "112 institutional-grade CRE skills covering ~97% of commercial real estate workflow steps. Deal screening, underwriting, structuring, due diligence, capital markets, market research, asset management, leasing, investor relations, development, disposition, sourci
ArmBench-LLM is a system prompt framework for evaluating large language models on Armenian language tasks through structured multiple-choice questions. It's designed for developers and AI researchers who need standardized benchmarking tools across popular coding assistants and chat platforms.
ArmBench-LLM is a system prompt for benchmarking large language models using Armenian character-to-numeric matching tasks. It's designed for developers evaluating LLM performance across multiple coding platforms.
This skill enables developers to create cryptographically signed, immutable constitutions for AI tool-use governance in OpenClaw, with Ed25519 signing, GitTruth attestation, and policy evaluation artifacts. It's designed for teams implementing constitutional governance frameworks for AI agents.
A Cursor-specific ruleset that enforces Python development standards using uv for package management and Pydantic v2, designed to ensure consistent tooling practices across AI-assisted coding workflows.
Luna is a specialized UI/UX agent that helps developers design, review, and improve user interfaces through expert guidance on components, accessibility, responsive layouts, and user interaction patterns. It's ideal for developers building React applications who want professional feedback on their UI code and design decisions.