AI SummaryThis skill automates the process of adding, extracting, and managing evaluation results in Hugging Face model cards, supporting multiple data sources including Artificial Analysis API and custom evaluations with vLLM/lighteval. It's valuable for ML practitioners and model maintainers who need to track and display model performance metrics.
Install
# Add to your project root as SKILL.md curl -o SKILL.md "https://raw.githubusercontent.com/huggingface/skills/main/skills/hugging-face-evaluation/SKILL.md"
Description
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Overview
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data: • Extracting existing evaluation tables from README content • Importing benchmark scores from Artificial Analysis • Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
Features
• vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods) • lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks • inspect-ai Framework: UK AI Safety Institute's evaluation library • Standalone or Jobs: Run locally or submit to HF Jobs infrastructure
Usage Instructions
The skill includes Python scripts in scripts/ to perform operations.
Prerequisites
• Preferred: use uv run (PEP 723 header auto-installs deps) • Or install manually: pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests • Set HF_TOKEN environment variable with Write-access token • For Artificial Analysis: Set AA_API_KEY environment variable • .env is loaded automatically if python-dotenv is installed
Quality Score
Good
75/100
Trust & Transparency
Open Source — Apache-2.0
Source code publicly auditable
Verified Open Source
Hosted on GitHub — publicly auditable
Actively Maintained
Last commit Yesterday
7.5k stars — Strong Community
438 forks