AI SummaryYou are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models. TRL provides CLI commands for post-training foundation models using state-of-the-art techniques: TRL is built on top of Hugging Face Transformers and Accelerate, providing s
Install
Copy this and paste it into Claude Code, Cursor, or any AI assistant:
I want to install the "trl-training" skill in my project. Please run this command in my terminal: # Install skill into your project mkdir -p .claude/skills/trl-training && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/trl-training/SKILL.md "https://raw.githubusercontent.com/huggingface/skills/main/skills/trl-training/SKILL.md" Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.
Description
Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.
Overview
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques: • SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets • DPO (Direct Preference Optimization): Align models using preference data • GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards. • RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards • Reward Model Training: Train reward models for RLHF TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
TRL Training Skill
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
trl sft - Supervised Fine-Tuning
Fine-tune language models on instruction-following or conversational datasets. Full training: `bash trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-5 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub ` Train with LoRA adapters: `bash trl sft \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2.0e-4 \ --num_train_epochs 1 \ --packing \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --eos_token '<|im_end|>' \ --eval_strategy steps \ --eval_steps 100 \ --use_peft \ --lora_r 32 \ --lora_alpha 16 \ --output_dir Qwen2-0.5B-SFT \ --push_to_hub `
trl dpo - Direct Preference Optimization
Align models using preference data (chosen/rejected pairs). Full training: `bash trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-7 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns ` Train with LoRA adapters: `bash trl dpo \ --dataset_name trl-lib/ultrafeedback_binarized \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --learning_rate 5.0e-6 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --max_steps 1000 \ --gradient_accumulation_steps 8 \ --eval_strategy steps \ --eval_steps 50 \ --output_dir Qwen2-0.5B-DPO \ --no_remove_unused_columns \ --use_peft \ --lora_r 32 \ --lora_alpha 16 `
Discussion
Health Signals
My Fox Den
Community Rating
Sign in to rate this booster