The Bitter Lesson: Extra Bitter, No Sugar — Why Some Base LLM Models Choke on RL

8 min readMar 25, 2025

Introduction: Another Day, Another Bitter Cup of RL

Recently, we’ve witnessed significant improvements in using reinforcement learning (RL) to enhance large language model (LLM) reasoning. Most of these advancements start from pretrained supervised fine-tuned (SFT) models — also called “base models.” But here’s where things get interesting — and somewhat bitter: recent findings highlight stark differences in how various base models respond to RL training.

Some models undergo dramatic “Aha! moments,” rapidly gaining new, emergent abilities post-RL, while others don’t. For instance, models like Qwen exhibit significant reflection behaviors with RL fine-tuning, whereas the LLaMA family often struggles. But why?

Taken from There May Not be Aha Moment in R1-Zero-like Training

Section 1: Sergey Serves the Bitterness — Actually, the Bitterest of Lessons

This phenomenon reminds me of Sergey Levine’s perspective in his talk, “The Bitterest of Lessons” (I highly recommend to watch it), building upon Richard Sutton’s original bitter lesson: “The two methods that scale arbitrarily are learning and search.”

Levine categorizes these two concepts explicitly:

Learning: Extracting patterns from data.
Search: Computationally making rational inferences through optimization.

Here’s the crucial insight from the talk: RL is fundamentally about search — a computational optimization to find better solutions. It makes me puzzle. Why RL is search?

Section 2: Well, Bitterness Is Connected (SFT → Learning, RL → Search)

Let’s first put the puzzle aside and just match the SFT and RL. The differences in SFT and RL as we see in the models align surprisingly well with Levine’s distinction

SFT (Learning) extracts knowledge from human-generated data, fully dependent on data quality and diversity.
RL (Search) optimizes policies computationally, searching beyond what data explicitly provides.

We can represent this succinctly:

Learning (SFT) → Learned policy/behaviors
        ↓
Search (RL)    → Optimal policy/behaviors

RL explicitly seeks an optimal or near-optimal solution given a dataset through computational inference. Therefore, factors like test-time compute, policy optimization, and inference effectiveness become essential considerations.

Section 3: Bitter Math — Why RL Is Optimization (Yes, Math Tastes Bitter)

But can we rigorously justify RL as search since it is a huge gap? Here I will offer my 2 cents on why RL is a search method, and potentially to be scalable. Let’s add some math. Math is bitter but cool. We want to formally outline the relaxation from a strict “search problem” to “optimization (approximate search)”

Step 1: Formal Search Problem

Formally, a search problem can be defined as a triple

where X is the set of instances (inputs), Y is the set of potential solutions. R would be a binary relation indicating valid solutions.

We then can turn the search problem into this:

Step 2: Relaxed Approximate Search (Optimization)
In practical scenarios, exact solutions often don’t exist or are impractical. We relax the binary relation into a scoring (objective) function f. Instead of a binary condition, we define a continuous function:

The original binary relation R can thus be considered as a special case where R=1 maps to f = 1 and R=0 maps to f<1. However, when this is not the case of the exact search we can just relax it to finding the max of f instead of strictly be 1. It then translates to out of all inputs what are the best approximation of y. Thus, the relaxed problem becomes an optimization:

Now it looks more like RL right?

Step 3: RL as Sequential Approximate Search
Reinforcement Learning fits naturally here. It considers sequential decision-making:

Instances x become environment states s.
Solutions y become actions or policies.
The objective f becomes the value function Q or cumulative rewards R.

And here we specifically like policy optimization problems since everyone uses them. Then we are optimizing a policy π for the max rewards.

We like π star, the best π

In other words, RL inherently performs optimization-based search over existing policies π and trying to find the best one given the dataset.

Extra Notes: the GRPO and Math reasoning

In the setting of recent paper on GRPO and math reasoning, we don’t even use the accumulated reward. It is optimize for the final answers which are 1 and 0 binary. GRPO finds a clever way to translate it into a score rather than just binary further clarifying RL as optimization over approximations rather than strict solutions. It might be feasible to just treat it as the Step 2 optimization problem directly. What we are really seeing is this:

Compare it with the step 2 optimization, we can see how similar they are. This formulation neatly illustrates RL as a form of relaxed, approximate search via optimization.

Section 4: Hypotheses Time — Why Some Models Love RL and Others Choke

Consider recent RL math reasoning methods (e.g., Deepseek Math). These approaches can be summarized as given a dataset D, we optimize directly on whether a behavior towards the Reward Function.

I am going to irresponsibly throw out 2 hypotheses here:

Hypothesis 1: Exploitation (Optimization Limitations)

For current LLM on reasoning problems, optimization can only search solutions from existing behaviors. If optimal behaviors aren’t implicitly represented in the initial (SFT) model, no amount of optimization will generate them.

Recent studies suggest that many RL-driven “Aha!” moments already exist in the initial models, merely requiring optimization to surface. If desired behaviors are absent initially, optimization alone cannot create them.

Hypothesis 2: Exploration (LLM Challenges)

But you might challenge me here, saying: “What about exploration?” After all, classic RL heavily emphasizes explorations and able to demonstrate new behaviors not in the dataset. Isn’t Hypothesis 1 overlooking this?

Here’s why I think exploration largely fails with LLMs:

LLM decoding is essentially random search (e.g., top-p/top-k sampling). If the correct behavior lies deep in the distribution tail, we’re extremely unlikely ever to sample it.
Thus, if the initial behavior isn’t “close enough” to the desired one, we practically never encounter it during exploration — making optimization impossible.
Moreover, current setups often reward entire rollouts due to sparse rewards, not intermediate steps, severely restricting RL’s effectiveness. There are no optimization at intermediate steps stitching new behaviors.

LLM-based RL thus differs significantly from traditional RL: exploration is inherently more challenging due to massive action spaces and sparse, high-level rewards.

Combine with hypothesis 1 and 2, the effectiveness of current RL optimization is inherently constrained by what behaviors already exist in the initial SFT model and how we design our rewards. RL in this setting mostly refines what’s implicitly there, rather than inventing completely new behaviors. LLaMA model families probably lack such behaviors in the base model, but the Qwen model families do have. That might cause the optimization to be success of fail.

Section 5: Empirical Evidence — The Qwen vs. LLaMA Bitter Taste Test

Recent research provides multiple data points supporting these hypotheses. Here is an interesting paper demonstrates the behavior differences in LLaMA and Qwen, and they are able to correct the base model behaviors in order to significantly improve the LLaMA model with RL results.

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

In addition, other papers and blogs such as Open-R1, oat-zero, SimpleRL-Zoo also showed the differences of different model families. Logic-RL is also a nice one demonstrates that many self correcting behaviors already exists in very early steps.

Overall, Qwen-based models benefit significantly from RL tuning, suggesting stronger initial behaviors, while LLaMA-based models typically stagnate due to weaker initial representations. And it is correctable with more SFT to injects behaviors.

Conclusion: The Bittersweet Lesson

Framing reinforcement learning (RL) explicitly as scalable search — optimization over existing representations — offers us a powerful new perspective on the classic interplay between Learning and Search.

Historically, we’ve invested substantial resources into Learning, where remarkable scaling laws have emerged. But now, recognizing RL as fundamentally a form of Search, we can envision an entirely new scaling law centered around Search itself. Consider “test-time compute,” I would view this concept as a special case of Search, specifically a subset of RL.

Thus, I argue that the true frontier now lies in scaling Search through reinforcement learning. Just as we’ve witnessed exponential gains from scaling up Learning, it’s now time to seriously explore and harness the scaling laws of Search.

Test Time Compute scaling from Learning to reason with LLMs. However I would argue it is the RL as Search.

In the short term, improving the initial SFT policies may be the most practical approach. Better starting representations ensure that RL’s search can more effectively optimize toward existing behaviors. Yet, to truly unlock transformative abilities — not just incremental “Aha!” moments, but groundbreaking discoveries akin to AlphaGo’s legendary “Move 37” — we must embrace deeper and more extensive use of RL itself, effectively scaling our Search capabilities.

Perhaps Sutton’s bitter lesson remains fundamentally correct. We’ve historically emphasized and scaled Learning, yet now is precisely the moment to intensify our investment in Search — that is, more powerful and more scalable RL. By developing better exploration techniques, intermediate rewards, and more sophisticated optimization methods, we can tackle larger, more challenging action spaces. Only then can we aspire to genuinely superhuman behaviors from our models.

Citation:

@misc{qi2025extrabitter,
  title={The Bitter Lesson: Extra Bitter, No Sugar—Why Some Base LLM Models Choke on RL},
  author={Jianing Qi},
  year={2025},
  howpublished={\url{https://j-qi.medium.com/295fee68feeb}},
  note={Medium Blog},
}

References:

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., & Goodman, N. D. (2025). Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs. arXiv:2503.01307 [cs.CL]. https://arxiv.org/abs/2503.01307

Liu, Z., Chen, C., Li, W., Pang, T., Du, C., & Lin, M. (2025). There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study. Notion Blog. https://oatllm.notion.site/oat-zero

von Werra, L., Tunstall, L., Gallouédec, Q., Penedo, G., Beeching, E., Lozhkov, A., Tousignant, B., & van Strien, D. (2025, February 2). Open-R1: Update #1. Hugging Face Blog. https://huggingface.co/blog/open-r1/update-1

Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y., Dai, B., Zhou, J., Qiu, K., Wu, Z., & Luo, C. (2025). Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning. arXiv:2502.14768 [cs.CL]. https://arxiv.org/abs/2502.14768

Zeng, W., Huang, Y., Liu, Q., Liu, W., He, K., Ma, Z., & He, J. (2025). SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild. arXiv:2503.18892 [cs.LG]. https://arxiv.org/abs/2503.18892