Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

TL;DR, because you have models to train and we respect that:

Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.

It turns out you do not have to. Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny.

We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB.

The cherry on top: we ran a full disaggregated training where the trainer was on one box, vLLM lived in a Hugging Face Space, the Wordle environment lived in another Space, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN.

Async RL just got a lot cheaper. Read on.

Two ways to

To finish reading, please visit source site