Skills Artificial Intelligence Train and Fine-Tune LLMs Using TRL

Train and Fine-Tune LLMs Using TRL

v20260701
trl-training
This skill provides expert capabilities for training and fine-tuning transformer language models using the TRL (Transformers Reinforcement Learning) library. It supports state-of-the-art post-training methods, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and Reward Model training. Use this skill to align and customize foundation models for advanced tasks via CLI commands.
Get Skill
316 downloads
Overview

TRL Training Skill

When to Use

Use this skill when you need train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

  • SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
  • DPO (Direct Preference Optimization): Align models using preference data
  • GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
  • RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
  • Reward Model Training: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

Core Commands

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

Full training:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

Train with LoRA adapters:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

Full training:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns

Train with LoRA adapters:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

Basic usage:

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

Basic usage:

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

Full training:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048

Train with LoRA adapters:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

Example config (sft_config.yaml):

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio

Launch with config:

trl sft --config sft_config.yaml

Override config values:

trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

Multi-GPU training:

trl sft \
  --config sft_config.yaml \
  --num_processes 4

Use predefined Accelerate configs:

TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3

trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2

Custom Accelerate config:

# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

Fully Sharded Data Parallel (FSDP):

trl sft --config sft_config.yaml --accelerate_config fsdp2

DeepSpeed ZeRO:

trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

CUDA Out of Memory

  • Reduce --per_device_train_batch_size and increase --gradient_accumulation_steps
  • Enable --use_peft for LoRA training
  • Use --gradient_checkpointing to save memory
  • Try smaller model or longer sequence truncation

Dataset Loading Issues

  • Verify dataset exists: check Hugging Face Hub or local path
  • Check dataset format matches expected columns
  • Use --dataset_config for multi-config datasets
  • Inspect dataset: from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

  • Verify model exists on Hugging Face Hub
  • Check if gated model requires authentication: hf auth login
  • For local models, provide absolute path
  • Ensure sufficient disk space and memory

Slow Training

  • Enable dataset --packing for short sequences
  • Use larger --per_device_train_batch_size if memory allows
  • Enable --tf32 for faster computation on Ampere GPUs
  • Use --bf16 on supported hardware
  • Consider multi-GPU training with --num_processes

Generation Issues (GRPO/RLOO)

  • Check prompt format in dataset
  • Adjust --temperature and --top_p for generation
  • Verify the reward function (for GRPO/RLOO)

Additional Resources

Best Practices

  1. Start with SFT: Always fine-tune base models with SFT before preference alignment
  2. Use LoRA for efficiency: Enable --use_peft for faster training and lower memory
  3. Monitor training: Use --report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking
  4. Save checkpoints: TRL automatically saves checkpoints in --output_dir
  5. Test on small datasets first: Verify pipeline works before full training
  6. Use configuration files: Create YAML configs for reproducibility
  7. Leverage Accelerate: Use multi-GPU training for faster iteration

When helping users with TRL:

  • Always check which training method is appropriate for their use case
  • Verify dataset format matches the expected schema
  • Recommend starting with smaller models for testing
  • Suggest LoRA for resource-constrained environments
  • Point to specific documentation sections for advanced features

Limitations

  • Use this skill only when the task clearly matches its upstream product or API scope.
  • Verify commands, API behavior, pricing, quotas, credentials, and deployment effects against current official documentation before making changes.
  • Do not treat generated examples as a substitute for environment-specific tests, security review, or user approval for destructive or costly actions.
Info
Name trl-training
Version v20260701
Size 9.12KB
Updated At 2026-07-02
Language