Enabling Large-Scale True-on-Policy RL by Bringing Tensor-Parallelism to Order

Author List: Ziyang Zhang*, Xinheng Ding*, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu**†**

*Co-First Authors. †Corresponding Author.

Date: 2025/11/25

Last Updated on November 25, 2025 | First Published on November 25, 2025 | :github::Github

<aside> 💡

TL;DR In modern RL training pipelines, discrepancies between the rollout engine (e.g., vLLM and SGLang) and the training engine (e.g., FSDP) create a well-known training–inference mismatch. These inconsistencies implicitly convert on-policy RL into off-policy learning and may even trigger catastrophic training instability. Recent advances have attempted to mitigate this by enforcing operator parity between training and inference components and by introducing batch-invariant kernels.

Yet, a critical factor has been largely overlooked: numerical deviations induced by mismatched tensor-parallel (TP) sizes. This mismatch is prevalent and inherently unavoidable in RL pipelines. To address this, we propose Tree-Based Invariant Kernels (TBIK), which ensure bit-wise identical results across different TP sizes and enable genuine on-policy RL at scale.

</aside>

Figure 1. RL training results. After applying TBIK, the KL divergence between the rollout engine and the training engine drops to zero, and we can observe faster entropy reduction and higher rewards compared with the vanilla baseline and batch invariant kernels only.

Figure 1. RL training results. After applying TBIK, the KL divergence between the rollout engine and the training engine drops to zero, and we can observe faster entropy reduction and higher rewards compared with the vanilla baseline and batch invariant kernels only.

Encouraging Progress toward Determinism, Yet TP Size Mismatch Lurks

In our LLM evaluation reproducibility report, we found that changing Batch size and GPU counts can impact the reasoning trace a lot. This is due to the differences in the floating point arithmetic order across configurations. In addition, the training–inference mismatch in the RL community has also attracted increasing attention. In summary, there are several common factors that lead to nondeterminism in LLM serving systems and RL training pipelines：

Different Batch Size: Changing batch sizes causes systems to select different hyperparameters (such as BLOCK_SIZE for MatMul) to optimize performance. This alters the order of floating-point computations, resulting in numerical divergence.
Different Tensor-Parallel Size: Different TP sizes change the number of workers participating in All-Reduce operations within Row Parallel layers, also affecting the computation order.
Different Operators: Different serving frameworks may employ distinct operators for the same function. For example, vLLM uses FusedRMSNorm to optimize inference speed.
Nondeterministic Operations: Operations such as Split-K MatMul (which uses atomicAdd) and NCCL All-Reduce introduce inherent randomness because the accumulation order depends on dynamic GPU thread scheduling.
Different GPU Architecture: GPUs with different architecture (e.g., H100 versus A100) utilize different instruction sets for computation, which leads to varying results.

Recent work has made great strides in solving LLM nondeterminism. Notably, Thinking Machines Lab introduced Batch-Invariant Operations (BIO) in their blog post, which ensure deterministic outputs cross different batch sizes. Specifically, by parallelizing computation strictly along the batch dimension and fixed the block size used for kernel computation, they ensure that the arithmetic for a specific request remains independent of the batch size. Following this, both vLLM and SGLang have integrated BIO into their frameworks. In addition, vLLM and SGLang further “addresses” the training–inference mismatch in RL by strictly enforcing the use of identical operators across training and rollout stages.

But here is the catch. While BIO fixes the batch dimension, it ignores another main source of variance in production pipelines, which is Tensor Parallel (TP) Size Mismatch.

Figure 2. Different frameworks and tensor-parallel settings lead to noticeable probability discrepancies for the same model, making it difficult to achieve stable and truly on-policy reinforcement learning.

Figure 2. Different frameworks and tensor-parallel settings lead to noticeable probability discrepancies for the same model, making it difficult to achieve stable and truly on-policy reinforcement learning.

In real-world settings, TP size is not fixed. It depends on the available hardware resources and is closely tied to performance optimization goals. For instance, the same model may run with TP=2 on one machine, but can also be configured with TP=8 on another to improve inference speed, as long as sufficient hardware resources are available. This creates a "TP Size Mismatch" across different deployment environments.

It is important to note that this mismatch is prevalent and inherently unavoidable in Reinforcement Learning (RL) pipelines, where the two stages adopt different frameworks that have completely different workloads:

Training Engine: Training involves both forward and backward pass. Under large batch size, it is compute bound. Often use FSDP with TP=1 for improving training throughput
Rollout Engine (vLLM/SGLang): Rollout doesn’t involve the backward pass, but has a memory-bound decoding process. Usually runs with TP=1/2/4/8 to fully utilize GPU resource during rollout. As shown in Figure 2.