vLLM V0 to V1: Correctness Before Corrections in RL

PipelineRL uses vLLM as the inference engine for rollout generation. The
inference engine samples tokens and returns token logprobs; the trainer uses
those logprobs to compute policy ratios, KL, clip rate, entropy, and reward.
Any discrepancy in how those logprobs are computed can change the training
dynamics. This is the train-inference mismatch we needed to eliminate during
the vLLM V0 to V1 migration.

TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things:
processed rollout logprobs, V1-specific runtime defaults, the inflight
weight-update path, and the fp32 lm_head used for the final projection. We
fixed the backend behavior before changing the RL objective.

The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Figure 1
shows the final result. The red run is the initial V1 attempt, and the green
run is the final V1 run after the

To finish reading, please visit source site