torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
Choose torchforge when you need:
Consider alternatives when:
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
# Create environment
conda create -n forge python=3.12
conda activate forge
# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh
# Verify
python -c "import torch, forge, vllm; print('OK')"
./scripts/install_rocm.sh
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Use this workflow for training reasoning models with group-relative advantages.
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # Responses per prompt
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL penalty coefficient
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re
# Or define your own reward function
class CustomMathReward:
def __call__(self, prompt: str, response: str, target: str) -> float:
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
python -m apps.grpo.main --config config/grpo_math.yaml
Use this workflow to implement new RL algorithms.
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
super().__init__()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# Compute importance ratio
ratio = torch.exp(logprobs - ref_logprobs)
# Clipped policy gradient
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL penalty
kl = ref_logprobs - logprobs
# Apply mask and aggregate
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
# In training loop
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
Use this workflow for scaling to multiple GPUs or nodes.
# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
parallelism:
tensor_parallel_degree: 2 # Split model across GPUs
pipeline_parallel_degree: 1
data_parallel_shard_degree: 2
services:
generator:
procs: 2 # 2 processes for TP=2
num_replicas: 1
with_gpus: true
trainer:
procs: 2
num_replicas: 1
with_gpus: true
# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
# 8 GPU setup
python -m apps.grpo.main \
--config config/distributed.yaml \
--trainer.procs 4 \
--generator.procs 4
torchforge uses dictionary-based batches for training:
# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]
# targets: list of dicts with training signals
targets = [{
"response": torch.Tensor,
"ref_logprobs": torch.Tensor,
"advantages": torch.Tensor,
"padding_mask": torch.Tensor
}]
# train_step returns loss as float
loss = trainer.train_step(inputs, targets)
Generated output from vLLM:
@dataclass
class Completion:
text: str # Generated text
token_ids: list[int] # Token IDs
logprobs: list[float] # Log probabilities
metadata: dict # Custom metadata
Loss functions are in the forge.losses module:
from forge.losses import SimpleGRPOLoss, ReinforceLoss
# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)
# Forward pass
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask
)
from forge.losses.reinforce_loss import ReinforceLoss
# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)
Symptoms: "Insufficient GPU resources" error
Solutions:
# Reduce service requirements
services:
generator:
procs: 1
with_gpus: true
trainer:
procs: 1
with_gpus: true
# Remove ref_model (uses generator weights)
Or use CPU for reference model:
ref_model:
with_gpus: false
Symptoms: CUDA OOM in vLLM
Solutions:
# Reduce batch size
grpo:
n_samples: 4 # Reduce from 8
# Or reduce sequence length
training:
seq_len: 2048
Symptoms: Long pauses between training and generation
Solutions:
# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1
# Or reduce sync frequency
training:
sync_interval: 10 # Sync every 10 steps
Symptoms: Entropy drops to zero, reward stops improving
Solutions:
# Increase KL penalty
grpo:
beta: 0.2 # Increase from 0.1
# Or add entropy bonus
training:
entropy_coef: 0.01