23 boosters for "evaluation" — AI-graded, open source, ready to install
Promptfoo is an LLM evaluation and testing toolkit that enables developers to systematically test, benchmark, and validate LLM prompts and RAG systems. It's essential for teams building production LLM applications who need confidence in prompt quality and model behavior.
This skill automates the process of adding, extracting, and managing evaluation results in Hugging Face model cards, supporting multiple data sources including Artificial Analysis API and custom evaluations with vLLM/lighteval. It's valuable for ML practitioners and model maintainers who need to track and display model performance metrics.
Automates GitHub pull request analysis by gathering diffs, comments, related issues, and local code context to provide comprehensive reviews. Developers and code reviewers benefit from faster, more thorough PR evaluations.
A quality assessment skill that runs automated validation, AI-powered evaluation, and risk scoring on SpecWeave increments to enforce quality gates. Developers and QA teams benefit from automated pass/fail decisions with detailed reasoning.
AgentAsJudge is an agentic evaluation framework that enables AI systems to critically review educational introductions by validating them against specified quality metrics and providing constructive feedback. It benefits educators, instructional designers, and developers building AI-assisted learning platforms who need reliable, fair assessment of educational content.
AgentAsJudge is an agentic evaluation framework that enables AI to systematically assess and compare the quality of multiple-choice questions across educational value, clarity, and answerability. It benefits educators, content creators, and assessment teams looking to automate quality control of exam and quiz questions.
Opendidac Cursor Rules is a specialized prompt that transforms Cursor into a senior fullstack developer assistant optimized for building an educational platform with diverse question types, code execution environments, and real-time evaluation tracking. It benefits educators and developers building sophisticated assessment and training systems.
ArmBench-LLM is a system prompt framework for evaluating large language models on Armenian language tasks through structured multiple-choice questions. It's designed for developers and AI researchers who need standardized benchmarking tools across popular coding assistants and chat platforms.
ArmBench-LLM is a system prompt for benchmarking large language models using Armenian character-to-numeric matching tasks. It's designed for developers evaluating LLM performance across multiple coding platforms.
A system prompt that configures an AI agent to evaluate text by delegating to a tool called `evaluate_review_text`, then summarize results and update graph state. Best suited for developers building evaluation workflows in Claude-based IDEs.
An MCP server that provides structured access to adversarial tactics and cyber attack techniques for security research, penetration testing, and AI safety evaluation. Useful for security professionals, red teamers, and AI safety researchers studying attack vectors.
PrismBench enables developers to create specialized LLM agents through YAML configuration for comprehensive benchmarking and evaluation of language model capabilities. Teams building AI evaluation systems and ML testing pipelines benefit from its systematic Monte Carlo Tree Search approach and containerized deployment.
PrismBench enables developers to create specialized LLM agents through YAML configuration for systematic evaluation of model capabilities using Monte Carlo Tree Search. Useful for ML engineers, researchers, and teams building production LLM systems who need comprehensive benchmarking and evaluation frameworks.
This skill enables developers to create cryptographically signed, immutable constitutions for AI tool-use governance in OpenClaw, with Ed25519 signing, GitTruth attestation, and policy evaluation artifacts. It's designed for teams implementing constitutional governance frameworks for AI agents.
A Chief Technology Officer agent that guides enterprise technology strategy decisions, including investment evaluation, technical vision setting, and architectural planning. Ideal for organizations needing structured CTO-level guidance on technology roadmaps and innovation initiatives.
A Cursor-specific ruleset that enforces Python development standards using uv for package management and Pydantic v2, designed to ensure consistent tooling practices across AI-assisted coding workflows.
A knowledge curation agent that researches, validates, and determines optimal storage methods (URL reference, local excerpt, or embedding) for information sources using parallel web scraping. Ideal for AI engineers building knowledge-intensive agent systems who need intelligent source evaluation and integration.
Spec-Judge evaluates and selects the best versions of requirement, design, and task specification documents based on comprehensive criteria like completeness, clarity, feasibility, and innovation. It helps development teams streamline spec development workflows by providing systematic evaluation and comparison of document versions.
Luna is a specialized UI/UX agent that helps developers design, review, and improve user interfaces through expert guidance on components, accessibility, responsive layouts, and user interaction patterns. It's ideal for developers building React applications who want professional feedback on their UI code and design decisions.
A clinical triage specialist agent that proactively assesses patient symptoms and urgency using evidence-based protocols to determine appropriate care settings. Benefits healthcare providers, telehealth platforms, and patient-facing health applications.
Scite integrates scientific literature search and evaluation into Claude, allowing developers to ground AI responses in peer-reviewed research with full-text access and trust assessment. Ideal for researchers, academics, and developers building knowledge-intensive applications.
Tara is a Design QA Agent that automates visual regression testing, cross-browser validation, accessibility audits, and responsive design evaluation for design teams. It benefits developers and QA professionals who need systematic, reusable testing across projects.
Enables cryptographically-signed constitutional governance for AI tool use, allowing teams to define immutable policy layers (constitution + signature) beneath mutable identity guidance. Ideal for organizations deploying autonomous agents requiring tamper-evident audit trails and policy enforcement.