Skip to content
Skill

trl-training

by huggingface

AI Summary

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models. TRL provides CLI commands for post-training foundation models using state-of-the-art techniques: TRL is built on top of Hugging Face Transformers and Accelerate, providing s

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "trl-training" skill in my project.

Please run this command in my terminal:
# Install skill into your project
mkdir -p .claude/skills/trl-training && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/trl-training/SKILL.md "https://raw.githubusercontent.com/huggingface/skills/main/skills/trl-training/SKILL.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

Description

Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques: • SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets • DPO (Direct Preference Optimization): Align models using preference data • GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards. • RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards • Reward Model Training: Train reward models for RLHF TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

TRL Training Skill

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets. Full training: `bash trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-5 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub ` Train with LoRA adapters: `bash trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-4 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --use_peft \ --lora_r 32 \ --lora_alpha 16 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub `

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs). Full training: `bash trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-7 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns ` Train with LoRA adapters: `bash trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-6 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns \ --use_peft \ --lora_r 32 \ --lora_alpha 16 `

Discussion

0/2000
Loading comments...

Health Signals

MaintenanceCommitted Yesterday
Active
Adoption1K+ stars on GitHub
10.7k ★ · Popular
DocsREADME + description
Well-documented

GitHub Signals

Stars10.7k
Forks703
Issues28
UpdatedYesterday
View on GitHub
Apache-2.0 License

My Fox Den

Community Rating

Sign in to rate this booster

Works With

Claude Code