Using reinforcement learning to endow large language models (LLMs) with reasoning abilities is indeed effective, but it comes at a high cost.
Such models generate a long chain of thought (LongCoT) before answering questions; moreover, increasing the number of “thinking tokens” can enhance the model’s capabilities. As with any reinforcement learning problem, there is an environment that determines how trajectories are generated.
For reasoning LLMs, this environment is quite simple and often overlooked: the state is formed by concatenating the prompt with the reasoning tokens generated so far, while the action is the next token sampled from the policy (i.e., the reasoning LLM).
This design seems elegant, but it may lead to an unbounded state size – it keeps growing as the thinking process lengthens. For attention-based policies, this means that the computational cost throughout the process will face a daunting quadratic increase.
To reduce the computational cost of long thinking in reasoning LLMs, many methods have been proposed, including using objective functions with length regularization, pruning, or early stopping methods.
Recently, a joint research team from multiple institutions, including Mila and Microsoft Research, took a different approach and posed a different question: What if the environment doesn’t cause a quadratic increase in computational cost from the start?
They proposed a new paradigm in which the policy conducts reasoning based on a fixed-size state. They named such a policy the Markovian Thinker.

Paper title: The Markovian Thinker
Paper link: https://arxiv.org/abs/2510.06557v1
Model link: https://huggingface.co/collections/McGill-NLP/the-markovian-thinker-68debd2919c4ae47f50706cd
Code repository: https://github.com/McGill-NLP/the-markovian-thinker
Amirhossein Kazemnejad, one of the three co-first authors of this study, said on X that the effectiveness of Delethink has opened up innovation in the reinforcement learning thinking environment. Additionally, the degree and effectiveness of Markovian thinking indicate that reasoning LLMs can be constructed differently, perhaps using non-quadratic architectures.

The Markovian Thinker
The core idea of the Markovian Thinker is to restructure the form of reinforcement learning so that regardless of the total thinking length, the effective state size read by the policy is bounded. The direct impact is profound: a longer thinking process only requires linear computational cost and constant memory, which is related to the thinking length, thus decoupling the two issues of “how long the model thinks” and “how much context it must process.”
They instantiated this idea through the Delethink paradigm. It is a reinforcement learning environment that guides Markovian behavior by organizing the reasoning process into a series of fixed-size chunks.

Delethink redefines the thinking reinforcement learning environment as a chunked, Markovian process: its generation process occurs in fixed-size chunks. At the boundary of each chunk, the environment resets the context to a new prompt that includes the original query and a short continuation from the previous chunk.
This forces the policy to learn to advance thinking across chunks by maintaining a text state, thus creating a “Markovian Thinker.”
In contrast, the LongCoT environment concatenates tokens without limit, so its state (and the model context) keeps growing as the trajectory lengthens.
The pseudocode of Algorithm 1 shows the training process for a single query.

For more details, please refer to the original paper. In short, with this design, both the generation stage and the backpropagation stage for updating the policy in Delethink scale linearly, while in LongCoT, they scale quadratically. The following figure shows the changes in FLOP, memory, backpropagation time, and generation time of LongCoT and Delethink when the thinking length increases from n tokens to nS tokens.

Remarkable Results
The team conducted experiments: the results of Delethink are very remarkable. Even when reasoning with an 8K-sized chunk, the DeepSeek R1-Distill 1.5B model trained with Delethink can think up to 24K tokens. Under the same 24K thinking budget, its performance on mathematical benchmarks can reach and exceed that of LongCoT-RL.


In terms of test-time extension, Delethink can continue to improve when the performance of LongCoT-RL saturates, bringing additional gains.

Furthermore, they trained the R1-Distill 1.5B model with Delethink to think up to 96K tokens; with only a few additional training steps, it achieved an accuracy of 49% on AIME’24, and the average length of its problem-solving process was 36K tokens.

The effect of linear computation is significant: based on experimental data, they estimated that for an average thinking length of 94K, LongCoT-RL training requires 27 H100 – months, while using Delethink only requires 7 H100 – months.
Why Does It Work?
To explore why Delethink training is effective, they also analyzed the model’s performance during the reinforcement learning initialization stage.
They observed that the R1-Distill series models (1.5B – 14B) can sample Markovian trajectories in a zero-shot manner without any additional training or prompting, and even recover most of the performance of standard LongCoT.

This strong initialization (i.e., a large number of in-distribution positive samples that conform to the expected behavior) provides a favorable starting point for reinforcement learning.
They further studied a reasoning model with up to 120B parameters in the Delethink environment. For example, GPT-OSS 120B (Agarwal et al., 2025) shows robust Markovian thinking abilities in multiple domains, such as Ph.D.-level questions, programming tasks, math competitions, and crossword puzzles.
These results together indicate that Delethink is compatible with state-of-the-art models and can scale with them.
Conclusion
The success of Markovian thinking indicates that decoupling the thinking length from the context size can, in principle, enable the next generation of reasoning models to think with millions of tokens. It highlights that the reinforcement learning environment, often regarded as fixed, is actually a powerful lever for driving progress.
This also suggests that sequence architectures with non-quadratic complexity may particularly benefit reasoning models, as the thinking process can be effectively transformed into a Markovian one.
This article is from the WeChat official account “Almost Human” (ID: almosthuman2014), author: Panda. Republished by 36Kr with permission.
 
				