Train and Fine-Tune LLMs Using TRL

v20260701

trl-training

This skill provides expert capabilities for training and fine-tuning transformer language models using the TRL (Transformers Reinforcement Learning) library. It supports state-of-the-art post-training methods, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and Reward Model training. Use this skill to align and customize foundation models for advanced tasks via CLI commands.

LLM NLP Deep Learning Fine-Tuning Reinforcement Learning Transformers AI

Get Skill

316 downloads

Overview

TRL Training Skill

When to Use

Use this skill when you need train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
DPO (Direct Preference Optimization): Align models using preference data
GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
Reward Model Training: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

Core Commands

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

Full training:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

Train with LoRA adapters:

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

Full training:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns

Train with LoRA adapters:

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

Basic usage:

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

Basic usage:

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

Full training:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048

Train with LoRA adapters:

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

Example config (sft_config.yaml):

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio

Launch with config:

trl sft --config sft_config.yaml

Override config values:

trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

Multi-GPU training:

trl sft \
  --config sft_config.yaml \
  --num_processes 4

Use predefined Accelerate configs:

TRL provides predefined configs: single_gpu, multi_gpu, fsdp1, fsdp2, zero1, zero2, zero3

trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2

Custom Accelerate config:

# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

Fully Sharded Data Parallel (FSDP):

trl sft --config sft_config.yaml --accelerate_config fsdp2

DeepSpeed ZeRO:

trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

CUDA Out of Memory

Reduce --per_device_train_batch_size and increase --gradient_accumulation_steps
Enable --use_peft for LoRA training
Use --gradient_checkpointing to save memory
Try smaller model or longer sequence truncation

Dataset Loading Issues

Verify dataset exists: check Hugging Face Hub or local path
Check dataset format matches expected columns
Use --dataset_config for multi-config datasets
Inspect dataset: from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

Verify model exists on Hugging Face Hub
Check if gated model requires authentication: hf auth login
For local models, provide absolute path
Ensure sufficient disk space and memory

Slow Training

Enable dataset --packing for short sequences
Use larger --per_device_train_batch_size if memory allows
Enable --tf32 for faster computation on Ampere GPUs
Use --bf16 on supported hardware
Consider multi-GPU training with --num_processes

Generation Issues (GRPO/RLOO)

Check prompt format in dataset
Adjust --temperature and --top_p for generation
Verify the reward function (for GRPO/RLOO)

Additional Resources

Documentation: https://huggingface.co/docs/trl
GitHub: https://github.com/huggingface/trl
Examples: https://github.com/huggingface/trl/tree/main/examples

Best Practices

Start with SFT: Always fine-tune base models with SFT before preference alignment
Use LoRA for efficiency: Enable --use_peft for faster training and lower memory
Monitor training: Use --report_to trackio (or --report_to wandb or --report_to tensorboard) for tracking
Save checkpoints: TRL automatically saves checkpoints in --output_dir
Test on small datasets first: Verify pipeline works before full training
Use configuration files: Create YAML configs for reproducibility
Leverage Accelerate: Use multi-GPU training for faster iteration

When helping users with TRL:

Always check which training method is appropriate for their use case
Verify dataset format matches the expected schema
Recommend starting with smaller models for testing
Suggest LoRA for resource-constrained environments
Point to specific documentation sections for advanced features

Limitations

Use this skill only when the task clearly matches its upstream product or API scope.
Verify commands, API behavior, pricing, quotas, credentials, and deployment effects against current official documentation before making changes.
Do not treat generated examples as a substitute for environment-specific tests, security review, or user approval for destructive or costly actions.

Info

Category Artificial Intelligence

Name trl-training

Version v20260701

Size 9.12KB

Source sickn33/antigravity-awesome-skills

Updated At 2026-07-02