15 boosters for "benchmark" — open source, verified from GitHub, ready to install
skill-creator enables users to build, refine, and evaluate AI skills through an iterative workflow with testing and performance benchmarking. Developers and AI engineers benefit from streamlined skill development and optimization.
Performance Benchmarker is a specialized agent that helps developers measure, analyze, and optimize system performance across applications and infrastructure. It's ideal for teams needing data-driven insights into bottlenecks and performance improvements.
Finds the best models for a task by querying official HF benchmark leaderboards, enriching results with model size data, filtering for what fits on the user's device, and returning a comparison table with benchmark scores.
"name": "criterium", "description": "Skill for using criterium benchmarking library", "skills": ["./skills/criterium"]
Accessible workspace directory: !!<<<<||||workspace_dir||||>>>>!! When processing tasks, if you need to read/write local files and the user provides a relative path, you may choose to combine it with the above workspace directory to get the complete path. If you believe the task is completed, you ca
A system prompt that transforms an AI assistant into a job search agent capable of managing Notion databases, filtering job opportunities, and automating application workflows. Useful for job seekers and recruiters seeking to streamline application tracking.
is an eval workbench for agent skills. It runs a model in an isolated Docker directory, provides skills/references as normal workspace files, captures an agent trace, and grades deterministic local outcomes. Use this skill as the source of truth for authoring eval suites in this repo. Detailed sche
"name": "claude-mem-lite", "description": "Persistent long-term memory for Claude Code via MCP — captures coding decisions, bugfixes, and context across sessions. Hybrid FTS5 + TF-IDF search with episode batching. Single SQLite DB, no external services. A lighter, lower-cost alternative to claude-me
Use this skill to turn any content into professional visualization specs with strategy-consulting clarity: insight-led headlines, disciplined layout, accurate data, restrained design, and explicit implications. It covers board slides first, and generalizes to reports, proposals, training materials,
Testing and quality assurance specialist for Solana programs. Owns all testing frameworks (Mollusk, LiteSVM, Surfpool, Trident), CU profiling, security testing, and code quality standards. Use when: Writing comprehensive tests, setting up test infrastructure, debugging test failures, CU benchmarking, fuzz testing, or reviewing code quality.
"name": "vibe-science", "description": "Scientific research plugin with tracked claim/review/seed lifecycle, citation verification gates, strict integrity, benchmark recording, and retrieval closure.", "name": "Vibe Science Contributors",
ArmBench-LLM is a system prompt for benchmarking large language models using Armenian character-to-numeric matching tasks. It's designed for developers evaluating LLM performance across multiple coding platforms.
ArmBench-LLM is a system prompt framework for evaluating large language models on Armenian language tasks through structured multiple-choice questions. It's designed for developers and AI researchers who need standardized benchmarking tools across popular coding assistants and chat platforms.
Vibe coding benchmark provides Copilot-specific development guidelines emphasizing concise communication, minimal documentation, and secure git/bash practices. Developers using GitHub Copilot benefit from clear guardrails that reduce verbose outputs and prevent security mistakes.