Recursive Reasoning: AI's Next Scaling Law Is Not Bigger, But Deeper

🤖 AI 速览

When pre-training yields diminishing returns, AI’s next order-of-magnitude breakthrough will come from self-iteration at inference time—not from brute-force parameter scaling.

📋 文章元数据

发布时间: 2026-05-02
类型: posts
标签: AI, recursive-reasoning, test-time-compute, scaling-law, latent-reasoning, ARC-AGI-2

Core Thesis: When the marginal returns of pre-training diminish, AI’s next order-of-magnitude breakthrough will come from “self-iteration during inference,” not from “brute-force parameter scaling.”

I. The Biggest Breakthrough Is Not From Bigger Models Link to heading

On the YC Podcast, investor Peter Steinberger said something that made the room go quiet:

“The real breakthrough isn’t making models bigger, it’s making them think longer at test time.”

In other words: The real game-changer is not building larger models, but making models think longer and deeper during inference.

The impact of this statement lies in its direct challenge to the most deeply held belief in the AI industry over the past three years—the Scaling Law. We have grown accustomed to this narrative: stack more parameters, feed more data, burn more GPUs, and the model will naturally become smarter. The leap from GPT-3 to GPT-4 seemed to prove this point.

But signals in 2025 are increasingly clear: the marginal returns of pre-training are diminishing. The same computational investment now yields a flattening curve of capability gains. While the industry is still debating “when the next trillion-parameter model will arrive,” a new curve has already begun to rise—Test-Time Compute Scaling, or recursive reasoning.

If stacking parameters is not the answer, then what is?

The answer is: Let the model invoke itself during inference, thinking iteratively like a human.

II. Recursive Reasoning: Not an Improvement on CoT, But a Paradigm Shift Link to heading

To understand recursive reasoning, one must first see what it is not.

Chain of Thought (CoT) was the first breakthrough. It allowed models to “speak out” their reasoning process, like writing down steps when solving a math problem. But CoT has a fundamental limitation: it is linear, single-pass, and irreversible. The model writes from left to right; once an intermediate step goes wrong, the entire reasoning may collapse.

Recursive reasoning takes an entirely different path.

In February 2025, a paper titled Scaling up test-time compute with latent reasoning: A recurrent depth approach (arXiv:2502.05171) offered a key insight: Truly efficient reasoning happens in the model’s hidden state space, not in token space.

What does this mean?

Imagine two painters. The first painter (CoT) must paint stroke by stroke on the canvas; each stroke must be visible and readable. If a stroke is wrong, they can only continue painting or use more strokes to cover it up. The second painter (latent reasoning) first constructs the complete image in their mind—adjusting composition, modifying light and shadow, trying different color palettes—all this “thinking” happens in an invisible mental space. Only when the image is fully mature in their mind do they put brush to canvas.

Latent reasoning is AI’s “mental composition.” The model iterates repeatedly in hidden state space, self-corrects, explores multiple reasoning paths in parallel, and ultimately outputs only the optimal result as readable tokens. This is not an upgraded version of CoT; it is a paradigm shift from “speaking thought” to “silent thought.”

III. Hard Validation: Record-Breaking Results on ARC-AGI-2 Link to heading

No matter how elegant the concept, it needs hard validation. In 2025, recursive reasoning achieved breakthrough progress on one of AI’s most rigorous benchmarks—ARC-AGI-2.

ARC-AGI-2, initiated by Keras author François Chollet, is considered the “gold standard” for testing AI abstract reasoning. It does not test knowledge储备 or pattern memorization, but rather the ability to grasp abstract rules from minimal examples and apply them flexibly—the core of human intelligence and the Achilles’ heel of traditional large models.

The solver developed by the Poetiq AI team (poetiq-ai/poetiq-arc-agi-solver) achieved record-breaking results on this benchmark. Their method was not to train a larger model to “remember” more patterns, but to dynamically search for optimal reasoning paths at test time—allowing the model to recursively try different strategies, evaluate intermediate results, backtrack, and re-explore when facing each specific problem.

Meanwhile, a paper published by the DeepSeek team in April 2025, Inference-Time Scaling for Generalist Reward Modeling (arXiv:2504.02495), validated this trend from another angle. They proved that even general reward models can significantly improve performance by dynamically allocating more computational resources at test time. This means recursive reasoning is not a trick for a specific task, but a generalizable paradigm for capability expansion.

Two independent lines of evidence point to the same conclusion: Test-time compute scaling has already proven its value on the most rigorous benchmarks.

There is a contrast worth savoring: On the ARC-AGI-2 leaderboard, some extremely small specialized models defeated general-purpose large models thousands of times larger through compute scaling—that is, by investing more inference rounds and search depth at test time. This is not “brute force”; it is “clever computation triumphs over brute force.” It reveals a counterintuitive fact: on tasks requiring abstract reasoning, computational investment during inference may be more decisive than the model’s parameter count itself.

IV. From Data Centers to Edge Devices: The Diffusion Path of Recursive Reasoning Link to heading

Whether a technology trend is truly established depends on whether it can diffuse from the lab to real-world scenarios. Recursive reasoning is showing surprising diffusion speed.

Recursive micro-networks on edge devices (stockeh/mlx-trm) is a landmark project. Based on Apple’s MLX framework, it implements recursive depth unfolding of Transformers on Apple Silicon. This means your MacBook, iPad, or even iPhone can theoretically run “deliberative” AI—not through cloud-based large models, but through local test-time compute scaling.

Agent scenarios are another early landing ground. The DeepRecall engine (kothapavan1998/deeprecall) is specifically designed for AI Agents with a “deep recall” mechanism: when an Agent faces a complex task, it can recursively invoke itself for sub-problem decomposition, reflect on intermediate results, and dynamically adjust strategy. This is no longer a single “input-output” interaction, but a thinking loop capable of self-dialogue and self-correction.

Even more intriguing is the Sakana AI survival simulator. In this project, recursively evolving AI Agents exhibit true emergent behavior in complex environments—they do not act according to preset rules, but autonomously learn complex strategies through test-time simulation and trial-and-error. Two Minute Papers put it well when introducing this project: these Agents are “not programmed to solve problems, but empowered to discover solutions themselves.”

V. When AI Learns to “Think in Its Sleep” Link to heading

The boundaries of recursive reasoning are still expanding rapidly.

In April 2025, a paper titled Sleep-Time Compute: Beyond Inference Scaling at Test-Time (arXiv:2504.13171) proposed a radical concept: sleep-time compute. The core idea is to let the model pre-compute possible reasoning paths and cache results during “idle” periods, thereby achieving instant response during actual inference.

This sounds like science fiction, but the logic is clear. Humans consolidate memories and organize thoughts while sleeping; why can’t AI do similar “pre-thinking” during “idle” time? As the boundary between training and inference begins to dissolve, we may need to redefine “thinking” itself—it is no longer a one-time computational process, but a continuous, layered, dynamic system where pre-computation and real-time inference intertwine.

This also has profound implications for the reinforcement learning post-training paradigm. If reward models themselves can improve judgment accuracy through test-time compute scaling, then the entire RLHF (Reinforcement Learning from Human Feedback) pipeline could be reshaped—not by training a static model that “better understands human preferences,” but by having the model invest more computational resources to “understand” context during each judgment.

VI. Conclusion: The Scaling Law Is Not Dead, It Has Just Changed Tracks Link to heading

Returning to the opening question: Is recursive reasoning replacing parameter scale as the new Scaling Law?

My judgment is: Not replacing, but relaying.

The pre-training Scaling Law is not dead—it has completed its historical mission, pushing AI from “unusable” to “usable.” But the baton for the next leg has already been passed to test-time compute scaling.

Three signals are already clear:

Competition breakthroughs: Record-breaking results on ARC-AGI-2 prove that recursive reasoning can solve problems that traditional methods cannot touch.
Industrial validation: DeepSeek’s reward model scaling proves this is not an isolated case, but a generalizable paradigm.
Edge deployment: From MLX micro-networks to DeepRecall Agents, recursive reasoning is moving out of data centers and into real products.

Of course, recursive reasoning is not a universal key. Along with its benefits come real engineering constraints: Time-To-First-Token (TTFT) will increase significantly—the model needs to complete multiple iterations in hidden state space before outputting the first token; inference cost in terms of compute consumption will also rise—each recursive unfolding is real computational overhead. Therefore, the applicable scope of recursive reasoning has natural boundaries: it yields the highest returns on structured reasoning tasks such as mathematical proofs, code debugging, and logic puzzles, where multi-round iteration can effectively correct intermediate errors; on generative tasks such as open-domain creative writing and casual conversation, the returns are relatively limited—users are usually unwilling to wait longer for a slight quality improvement.

Finally, I want to leave you with a question—not an answer, but an open inquiry:

When AI can think recursively without limit, when “thinking” is no longer constrained by the time boundaries of a single forward pass, does the definition of “thinking” itself need to be rewritten?

Human thinking is constrained by biological time, energy, and attention. AI thinking may be breaking through these limits. This is not a question of whether AI will surpass humans—it is a question of what the essence of intelligence is when “thinking” becomes a computational resource that can be arbitrarily expanded.

And this question may be worth pondering more than any technical breakthrough.

References Link to heading

Steinberger, P. (2025). YC Podcast interview. Core quote: “The real breakthrough isn’t making models bigger, it’s making them think longer at test time.”
Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv:2502.05171. https://arxiv.org/abs/2502.05171
DeepSeek. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495. https://arxiv.org/abs/2504.02495
Sleep-Time Compute: Beyond Inference Scaling at Test-Time. arXiv:2504.13171. https://arxiv.org/abs/2504.13171
Poetiq AI. poetiq-arc-agi-solver. https://github.com/poetiq-ai/poetiq-arc-agi-solver
stockeh. mlx-trm. https://github.com/stockeh/mlx-trm
kothapavan1998. deeprecall. https://github.com/kothapavan1998/deeprecall
Two Minute Papers. Sakana AI’s Survival Simulator Is Brilliant. https://www.youtube.com/watch?v=QzZ4VwDHAT4
Chollet, F. ARC-AGI-2. https://github.com/arcprize/ARC-AGI-2

Completed on 2026-05-02 | Content OS Phase 4 Final Draft | Task ID: TOPIC-B-20260502