Zichen Liu, Changyu Chen, Wenjun Li*, Tianyu Pang, Chao Du, Min Lin

Codes: https://github.com/sail-sg/oat-zero


One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-reflection, which helps it to conduct in-context search to solve complex reasoning problems.

Within only a few days after R1-Zero’s release, several projects independently “reproduced” R1-Zero-like training on smaller scales (e.g., 1B to 7B) and all observed the Aha moment, which is typically typically accompanied by an increase in response length. We follow their settings to scrutinize the R1-Zero-like training process, and share the following findings in this blog:

<aside> 💡

  1. There may NOT be Aha moment in R1-Zero-like training. Instead, we found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models.
  2. We found Superficial Self-Reflection (SSR) from base models’ responses, in which case self-reflections do not necessarily lead to correct final answers.
  3. We took a closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions. </aside>

1. Aha Moment Appears at Epoch 0

1.1 Experiment settings

Base models. We investigate a wide range of base model families cooked by different organizations, including Qwen-2.5, Qwen-2.5-Math, DeepSeek-Math, Rho-Math, and Llama-3.x.

Prompt templates. We directly prompt base models using the templates applied in R1-Zero and SimpleRL-Zero:

Data. We collect 500 questions from the MATH training dataset, which uniformly cover all five difficulty levels and all subjects, to fill in the {Question} in the above templates.

Generation parameters. We perform a grid search for the exploration parameter (temperature) from 0.1 to 1.0 for model inference on selected questions. Top P is set to 0.9 for all experiments. We generate 8 responses for each question.

1.2 Empirical results

We first tried all combinations of models and prompt templates (Template 1 or 2), ****then select the best template for each model according to their instruction following ability and fix it for all experiments. Through a careful investigation, we have the following finding:

<aside> 😲

Finding: Aha moment appears at epoch 0. We observe that all models (except Llama-3.x series) already exhibit self-reflection patterns without any post-training.

</aside>

Qualitatively, we list all observed keywords that suggest self-reflection patterns in the following table. Note that this list may not be exhaustive. The keywords are verified by humans, and words like "wait" are filtered out because their presence may not necessarily signify self-reflection but could result from hallucinations. We note that different models display distinct keywords associated with self-reflection, which we hypothesize is affected by their pre-training data.