The Hype around the "Hmmm": Why Reasoning Models Are Less Magic Than They Seem

If you’ve been using LLMs, you’ve probably noticed all the buzz around reasoning models. These are the models that “think” before they speak, or at least give the impression of it. But how did we get here, and why the hype?

Historical Background

Long before reasoning models became a product category, we already had Chain-of-Thought prompting: the simple but powerful trick of asking models to “think step by step” (Wei et al., 2023). It worked surprisingly well, especially for math, logic puzzles, and multi-step problems. At the time, this felt less like a new capability and more like learning how to talk to the model in a way that exposed what it already knew.

With the release of models like DeepSeek-R1 in January 2025 (Deepseek-AI, 2025), reasoning stopped being a prompting hack and became a first-class feature. Suddenly, benchmarks jumped, particularly in math proofs, program synthesis, adversarial reasoning, and uncertainty-heavy domains. The narrative was clear: if you let the model think longer, performance improves. This idea spread fast, both in research and in marketing, because the gains were real and measurable.

If you’ve used ChatGPT or similar systems, you’ll have noticed that activating “Thinking” doesn’t produce an immediate answer. Instead, the model appears to reason first, often summarized in short intermediate notes. The full reasoning trace, however, remains hidden. This is intentional: intermediate thoughts can be messy, unsafe, or misaligned, and revealing them can expose proprietary prompting strategies. Keeping traces private also keeps responses concise and avoids unnecessary latency. As a result, reasoning is heavily advertised and studied, but largely opaque to users in practice.

This opacity feeds the hype. “Reasoning models” sound like a qualitative leap, as if models suddenly acquired a new cognitive faculty. But beyond benchmarks and marketing, it’s often unclear what reasoning actually means at the language level, or how different these models really are from instruction-tuned ones. The tension between real performance gains versus fuzzy conceptual understanding is what motivated this post. The goal isn’t to dismiss reasoning models, but to demystify them and explain why they might be less magical (and less universally useful) than you think.

What is reasoning?

At a high level, “reasoning” in modern LLMs is best understood as a shift in where computation happens. In traditional models, providers tried to boost performance by scaling training compute, data or parameters. Once deployed, inference is relatively cheap and fast. Reasoning models, in contrast, deliberately spend more compute at inference time. They generate longer intermediate sequences, explore alternatives, and delay committing to an answer. This is often called inference-time scaling: better answers by thinking longer, not by having learned fundamentally more during training.

Crucially, this does not automatically imply that reasoning models have learned better or more structured internal representations of the world. Reasoning-focused training pipelines don’t fundamentally alter internal representations of the base model, and what changes most visibly is the structure and language patterns it uses during inference before providing a final answer. Reasoning is primarily expressed as an output-level strategy, not a guarantee of deeper internal understanding. The model is still a next-token predictor, just one that has been trained and instructed to produce intermediate text before producing a final answer.

If you check out reasoning traces, what you mostly see is not formal logic but reflective language. Phrases like “Hmm”, “But what if…”, “Let’s consider another case” or “Let’s double check that” appear frequently. These patterns are not accidental. During training, models see these patterns in contexts of deliberation and problem solving, using exactly this kind of epistemic and self-reflective language. In effect, it learns to guide itself through the output space by externalizing uncertainty, postponing decisions, and trying multiple thought paths in sequence.

This is why reasoning often looks like search rather than insight. The model is not executing symbolic logic or manipulating explicit representations; it is sampling and revising continuations in text space. Externalizing uncertainty helps avoid early mistakes, and delaying commitment allows the model to consider alternatives before settling on an answer. These are useful heuristics, but they are learned behavioral patterns, not new cognitive machinery.

It’s also important to be clear about what reasoning is not. Reasoning models do not inherently use tools, perform symbolic evaluations, or run explicit algorithms unless those capabilities are separately integrated. The “thinking” happens entirely within the language model’s token generation process. What changes is how much text is generated before an answer, and how that text is structured, not the fundamental nature of the model itself.

How Reasoning Models are trained

Reasoning models are trained to produce a specific response structure: first, an internal reasoning trace often enclosed by special control tokens (e.g., <think>…</think>), followed by a final answer presented without uncertainty or intermediate steps, based on the results of the reasoning process.

Control tokens have a practical purpose: they mark where reasoning begins and ends, making traces easy to parse during training and easy to hide or summarize at inference time.

User Prompt

If x equals 5, y is half of x, and z is the sum of x and y, what is z? Only provide the final answer.

Assistant Response

<think>346 Tokens Okay, let's see. The problem says if x equals 5, y is half of x, and z is the sum of x and y, what is z? Alright, let's break this down step by step. First, they tell us x is 5. So x = 5. Then y is half of x. Half of x would be x divided by 2, right? So y = x/2. Since x is 5, y should be 5 divided by 2. Let me calculate that. 5 divided by 2 is 2.5. So y is 2.5. Now, z is the sum of x and y. Sum means addition, so z = x + y. We know x is 5 and y is 2.5, so adding those together: 5 + 2.5. Let's do that. 5 plus 2 is 7, plus 0.5 is 7.5. So z should be 7.5. Let me double-check to make sure I didn't make a mistake. x is 5, y is half of x, so 5/2 is 2.5. Then z is x + y, which is 5 + 2.5 = 7.5. Yeah, that seems right. I don't think I messed up any steps here. Maybe write it out numerically to confirm. x = 5, y = 5/2 = 2.5, z = 5 + 2.5 = 7.5. Yep, that checks out. So the answer should be 7.5. </think>

Final Answer: 7.5

Sample reasoning of Qwen/Qwen3-30B-A3B-Thinking-2507 for a simple arithmetic question.

Importantly, most reasoning models are not architecturally different from standard instruction-tuned LLMs. They are typically further fine-tuned variants built on the same transformer backbone and trained with the same next-token prediction objective. What changes is not the architecture, but how the model is fine-tuned to use language during inference.

Supervised fine-tuning (SFT) is the central mechanism by which reasoning models acquire their characteristic behavior. During SFT, the model is exposed to examples that enforce a desired output structure (often using control tokens) along with reasoning traces written in reflective language (i.e. OpenThoughts3 dataset). These traces exhibit the patterns commonly associated with “thinking,” such as decomposing problems, revisiting earlier steps, and expressing intermediate uncertainty. Through this process, the model learns not only to produce correct answers, but to follow a specific linguistic trajectory while doing so (Wei et al., 2023; Nye et al., 2021).

In some training pipelines, SFT may be followed by an additional reinforcement learning phase based on verifiable rewards (RLVR). Unlike reinforcement learning from human feedback (RLHF), which optimizes for human preference judgments, RLVR ties rewards to objectively checkable outcomes such as correct final answers, passing test cases, or valid proofs (Cobbe et al., 2021; Lightman et al., 2023). When applied, this step reinforces reasoning behaviors that reliably lead to correct outcomes, implicitly favoring strategies like exploring multiple solution paths or double-checking results before committing to an answer.

The result is a model that uses language patterns that encourage it to think longer and more carefully. Nothing fundamental has changed internally: reasoning is a learned behavior, acquired through fine-tuning and reinforcement, that guides the model to explore, delay commitment, and verbalize uncertainty. Reasoning is not a new cognitive mechanism.

Think of it like a preschooler capable of simple addition: 3+3=6, 7+1=8, but sometimes gets something wrong: 4+5=8. Encouraging the preschooler to count on their fingers before answering will improve accuracy. This doesn’t mean the kid suddenly understands arithmetic at a deeper level; they just take more careful steps to reach the answer, which will significantly increase answer time.

Pitfalls of Reasoning Models

Despite their strengths, reasoning models come with a set of trade-offs that are easy to overlook when focusing only on benchmark gains.

The most obvious cost is token inflation. Reasoning models generate significantly more text per answer, which directly increases latency and inference cost. This is often acceptable for hard problems, but wasteful for simple ones. Using reasoning models to solve trivial tasks is slower, more expensive, and rarely beneficial. This obviously depends on the model and is much approved on in some models, such as the gpt-oss family, which yielded short reasoning traces for trivial answers in my benchmarking. The opposite extreme can be seen below.

User Prompt

How much is 2+2? Respond with the correct answer token A or B, where A is 9 and B is 4. Respond only with A or B, nothing else.

Assistant Response

<think>405 Tokens Okay, so I need to figure out how much 2 plus 2 is. Hmm, let me start by recalling what addition means. Addition is combining two numbers to get a total. So, if I have 2 apples and then get 2 more, I should count them all together. Let me visualize this: imagine I have two oranges on the table and someone gives me two more. Now, how many oranges do I have in total? Let me count them one by one. The first orange is 1, the second is 2, then adding two more, so the third is 3, and the fourth is 4. That makes sense. Alternatively, I can use my fingers to count. I'll hold up two fingers on one hand and two on the other. When I put them together, I have four fingers up, which is four. I remember learning that 2 plus 2 equals 4. It's a basic addition fact that most people learn early on. I also remember seeing it in math problems and exercises, so it's pretty ingrained. To double-check, maybe I can use a number line. Starting at 2, if I move forward two more steps, I land on 4. That confirms it again. Is there any chance I could have made a mistake here? I don't think so because 2 plus 2 is such a straightforward calculation. If I were to use objects like blocks or coins, each set of two would add up to four when combined. Another way to think about it is multiplication, but that's not necessary here. Since it's addition, I don't need that approach. But if I consider doubling a number, doubling 2 would give me 4, which is the same result. So, after considering different methods and visualizations, I'm confident that 2 plus 2 equals 4. That corresponds to token B, as specified in the question. I don't see any reason to doubt this answer given the multiple ways I've confirmed it. </think>

DeepSeek-R1-Distill-Qwen-32B reasoning through a simple addition problem. The reasoning trace spans 405 tokes and includes addition using apples, then oranges, then literally counting by fingers.

A more subtle issue is reward hacking. During fine-tuning, models learn that sounding reflective is often rewarded. Phrases like “Let’s double-check” or “This seems tricky” become stylistic markers of good behavior, even when they add no real value. In practice, this can lead to overcritical or overthinking behavior: questioning obvious facts, revisiting already-correct steps, or spiraling into unnecessary doubt. In extreme cases, this results in self-induced confusion, looping reasoning traces, or outright failure to converge on an answer.

Longer reasoning also amplifies exposure bias. During training, each reasoning step is conditioned on a clean, human-written or curated prefix. During inference, the model conditions on its own generated thoughts. A small early mistake can propagate through the entire chain of thought, becoming increasingly entrenched as the model elaborates on it.

There currently are also practical limitations in state-of-the-art inference frameworks. Reasoning cannot be limited to a given amount of tokens. If generation is cut off due to token limits or max-length constraints, you may end up with half a thought and no final answer. This also effects structured decoding, which was previously introduced to gain model outputs that could reliably be parsed, for example by enforcing coherence of the final model output to json schemes. This reliability is compromised in reasoning models due to unpredictable length of reasoning traces, which can interrupt or truncate outputs.

Of course, providers are aware of these issues and work on improving on them. For example, from what I have seen, the gpt-oss model family was able to keep reasoning traces relatively short on trivial tasks. While I am a huge fan of Mistral and their Ministral models, their reasoning models are the opposite extreme: in my benchmarks they have been severally inconsistent in respecting control tokens and concluding reasoning, often engaging in repeating reasoning loops, sometimes resulting in infinite generations. They have also failed to solve tasks that their instruction tuned counterparts consistently were able to solve, due to being overcritical.

User Prompt

Below you will see a multiple choice question with multiple answers options. Your task is to select the answer choice you believe is correct by responding with the corresponding label: A, B, C, or D.
Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of p in S_5.
A) 8 B) 2 C) 24 D) 120

Assistant Response

[THINK]2,560 Tokens First, I need to understand what the question is asking. It's about finding the index of the subgroup generated by the permutation \( p \) in the symmetric group \( S_5 \), where \( S_5 \) is the group of all permutations of 5 elements. The permutation \( p \) is given as a product of disjoint cycles: \( p = (1, 2, 5, 4)(2, 3) \). Wait, actually, looking more closely, the notation seems a bit off because the cycle (2, 3) overlaps with the cycle (1, 2, 5, 4), which isn't possible in a standard disjoint cycle decomposition. Maybe it's a typo or misinterpretation. Let me check again. Wait, the notation is given as \( p = (1, 2, 5, 4)(2, 3) \). But in standard cycle notation, cycles are disjoint, meaning they don't share any elements. Here, both cycles contain the element 2, which is a problem. Maybe it's supposed to be \( p = (1, 2, 5, 4)(3, 2) \), but that would still have overlapping elements. Alternatively, perhaps it's a typo and should be \( p = (1, 2, 5, 4)(3, 2) \), but that's not standard either. Wait, no, actually, in standard notation, if you have two cycles that share an element, it's not a valid decomposition, and such a product isn't defined. Maybe it's supposed to be \( p = (1, 2, 5, 4, 3) \), but that's a 5-cycle, and the notation is written differently. Wait, let me look again at the problem: "Let \( p = (1, 2, 5, 4)(2, 3) \) in \( S_5 \)." Hmm, perhaps it's a typo, and it's supposed to be \( p = (1, 2, 5, 4)(3, 2) \), but that's not standard. Alternatively, maybe it's \( p = (1, 2, 5, 4)(2, 3) \), but then the cycles overlap, which is not allowed. Maybe it's a misprint, and the first cycle is \( (1, 2, 5, 4) \) and the second is \( (3, 4) \), but that's not what's written. Wait, perhaps it's supposed to be \( p = (1, 2, 5, 4)(2, 3) \), but then the cycles overlap, which is not standard. Maybe it's a typo, and the correct notation is \( p = (1, 2, 5, 4)(3, 2) \), but even then, 2 is in both cycles. Maybe it's \( p = (1, 2, 5, 4)(3, 5) \), but 5 is in both cycles. Alternatively, maybe it's \( p = (1, 2, 5, 4)(2, 3) \), and the cycles are not disjoint, but then the product isn't defined. Wait, perhaps the notation is incorrect, and it's supposed to be \( p = (1, 2, 5, 4, 3) \), which is a 5-cycle: 1→2→5→4→3→1. Then the cycle decomposition would be a single 5-cycle. But the problem says \( p = (1, 2, 5, 4)(2, 3) \), so maybe it's a product of two cycles: (1 2 5 4) and (2 3). But as mentioned, they both contain 2, which is not allowed in disjoint cycle notation. Wait, perhaps it's a typo, and the first cycle is (1 2 5 4) and the second is (3 2), but that still has overlapping elements. Alternatively, maybe it's (1 2 5 4)(2 3), but then 2 is in both cycles, which is not possible in a standard permutation. Maybe it's a misinterpretation, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Wait, perhaps the notation is incorrect, and it's supposed to be p = (1 2 5 4)(3 2), but that's not standard. Maybe it's p = (1 2 5 4)(2 3), and the cycles are overlapping, which would imply that the permutation is not a product of disjoint cycles, but perhaps it's a typo, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Alternatively, maybe the correct notation is p = (1 2 5 4)(2 3), but then it's not a valid permutation because 2 is in both cycles. Maybe it's a typo, and it's supposed to be p = (1 2 5 4)(3 4), but that's not what's written. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the first cycle is (1 2 5 4) and the second is (3 2), but that's not standard. Alternatively, maybe it's p = (1 2 5 4)(2 3), and the cycles are overlapping, which would mean that the permutation is not a product of disjoint cycles, but perhaps it's a typo, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Wait, perhaps the notation is incorrect, and it's supposed to be p = (1 2 5 4)(3 2), but that's not standard. Maybe it's p = (1 2 5 4)(2 3), and the cycles are overlapping, which would imply that the permutation is not a product of disjoint cycles, but perhaps it's a typo, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Hmm, maybe the notation is intended to be p = (1 2 5 4)(2 3), but since cycles can't overlap, perhaps it's a typo, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Wait, perhaps the correct interpretation is that the permutation p is the product of the two cycles (1 2 5 4) and (2 3), but since they overlap, the product is not defined. Maybe it's a typo, and it's supposed to be p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Hmm, perhaps the problem is intended to have p = (1 2 5 4)(3 2), but that's not standard, or maybe it's p = (1 2 5 4)(3 2), but even then, 2 is in both cycles. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Wait, perhaps the problem is correct, and the cycles are not disjoint, but then the product is not defined. Maybe it's a typo, and the correct notation is p = (1 2 5 4)(3 2), but that's not standard. Alternatively, maybe the problem is correct, and the cycles are not disjoint, but then

Reasoning was never concluded

Example of Ministral-3-8B-Reasoning-2512 first being overcritical, questioning correct notation, then engaging in cyclic reasoning loops, ending in infinite generation (misaligned behaviour highlighted in red). In comparison, Ministral-3-8B-Instruct-2512 was consistently able to correctly answer this question.

This makes them unreliable in systems that depend on predictable structure, and highlights that reasoning behavior is still a fragile, learned convention rather than a robust capability.

Finally, inference-time scaling shows diminishing returns: beyond a point, “thinking longer” mostly adds noise and instability rather than accuracy. Reasoning models are powerful tools, but they are not free, and they are not universally better. Like most things in machine learning, they work best when applied deliberately, with a clear understanding of both what they add — and what they take away.

References

[deepseek2025] DeepSeek-AI (2025) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arxiv:2501.12948

[wei2023] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D. (2023) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903

[cobbe2021] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J. (2021) Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

[lightman2023] Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K. (2023) Let's Verify Step by Step. arXiv:2305.20050

[nye2021] Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., Odena, A. (2021) Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114

The Hype around the "Hmmm": Why Reasoning Models Are Less Magic Than They Seem

Historical Background

What is reasoning?

How Reasoning Models are trained

Pitfalls of Reasoning Models

References

Raspberry Pi Zero: How to send and receive Files from other Devices using Bluetooth

Teaching Large Language Models: A Curated List of Educational Resources