Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
TL;DR, because you have models to train and we respect that:
- Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.
- It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny.
- We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB.
- The cherry on top: we ran a full disaggregated training where the trainer was on one box, vLLM lived in a Hugging Face Space, the Wordle environment lived in another Space, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN.
Async RL just got a lot cheaper. Read on.