Author List: Ziyang Zhang*, Xinheng Ding*, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu**†**
*Co-First Authors. †Corresponding Author.
Date: 2025/11/25
Last Updated on November 25, 2025 | First Published on November 25, 2025 | :github::Github
<aside> 💡
TL;DR In modern RL training pipelines, discrepancies between the rollout engine (e.g., vLLM and SGLang) and the training engine (e.g., FSDP) create a well-known training–inference mismatch. These inconsistencies implicitly convert on-policy RL into off-policy learning and may even trigger catastrophic training instability. Recent advances have attempted to mitigate this by enforcing operator parity between training and inference components and by introducing batch-invariant kernels.
Yet, a critical factor has been largely overlooked: numerical deviations induced by mismatched tensor-parallel (TP) sizes. This mismatch is prevalent and inherently unavoidable in RL pipelines. To address this, we propose Tree-Based Invariant Kernels (TBIK), which ensure bit-wise identical results across different TP sizes and enable genuine on-policy RL at scale.
</aside>

Figure 1. RL training results. After applying TBIK, the KL divergence between the rollout engine and the training engine drops to zero, and we can observe faster entropy reduction and higher rewards compared with the vanilla baseline and batch invariant kernels only.
In our LLM evaluation reproducibility report, we found that changing Batch size and GPU counts can impact the reasoning trace a lot. This is due to the differences in the floating point arithmetic order across configurations. In addition, the training–inference mismatch in the RL community has also attracted increasing attention. In summary, there are several common factors that lead to nondeterminism in LLM serving systems and RL training pipelines:
BLOCK_SIZE for MatMul) to optimize performance. This alters the order of floating-point computations, resulting in numerical divergence.All-Reduce operations within Row Parallel layers, also affecting the computation order.FusedRMSNorm to optimize inference speed.atomicAdd) and NCCL All-Reduce introduce inherent randomness because the accumulation order depends on dynamic GPU thread scheduling.Recent work has made great strides in solving LLM nondeterminism. Notably, Thinking Machines Lab introduced Batch-Invariant Operations (BIO) in their blog post, which ensure deterministic outputs cross different batch sizes. Specifically, by parallelizing computation strictly along the batch dimension and fixed the block size used for kernel computation, they ensure that the arithmetic for a specific request remains independent of the batch size. Following this, both vLLM and SGLang have integrated BIO into their frameworks. In addition, vLLM and SGLang further “addresses” the training–inference mismatch in RL by strictly enforcing the use of identical operators across training and rollout stages.
But here is the catch. While BIO fixes the batch dimension, it ignores another main source of variance in production pipelines, which is Tensor Parallel (TP) Size Mismatch.

Figure 2. Different frameworks and tensor-parallel settings lead to noticeable probability discrepancies for the same model, making it difficult to achieve stable and truly on-policy reinforcement learning.
In real-world settings, TP size is not fixed. It depends on the available hardware resources and is closely tied to performance optimization goals. For instance, the same model may run with TP=2 on one machine, but can also be configured with TP=8 on another to improve inference speed, as long as sufficient hardware resources are available. This creates a "TP Size Mismatch" across different deployment environments.
It is important to note that this mismatch is prevalent and inherently unavoidable in Reinforcement Learning (RL) pipelines, where the two stages adopt different frameworks that have completely different workloads: