Zichen Liu, Changyu Chen, Wenjun Li*, Tianyu Pang, Chao Du, Min Lin
Codes: https://github.com/sail-sg/oat-zero
One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-reflection, which helps it to conduct in-context search to solve complex reasoning problems.
Within only a few days after R1-Zero’s release, several projects independently “reproduced” R1-Zero-like training on smaller scales (e.g., 1B to 7B) and all observed the Aha moment, which is typically typically accompanied by an increase in response length. We follow their settings to scrutinize the R1-Zero-like training process, and share the following findings in this blog:
<aside> 💡
Base models. We investigate a wide range of base model families cooked by different organizations, including Qwen-2.5, Qwen-2.5-Math, DeepSeek-Math, Rho-Math, and Llama-3.x.
Prompt templates. We directly prompt base models using the templates applied in R1-Zero and SimpleRL-Zero:
Template 1 (the same as in R1-Zero)
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: {Question} Assistant:
Template 2 (the same as in SimpleRL-Zero)
<|im_start|>system\\nPlease reason step by step, and put your final answer within \\boxed{}.<|im_end|>\\n<|im_start|>user\\n{Question}<|im_end|>\\n<|im_start|>assistant
Data. We collect 500 questions from the MATH training dataset, which uniformly cover all five difficulty levels and all subjects, to fill in the {Question} in the above templates.
Generation parameters. We perform a grid search for the exploration parameter (temperature) from 0.1 to 1.0 for model inference on selected questions. Top P is set to 0.9 for all experiments. We generate 8 responses for each question.
We first tried all combinations of models and prompt templates (Template 1 or 2), ****then select the best template for each model according to their instruction following ability and fix it for all experiments. Through a careful investigation, we have the following finding:
<aside> 😲
Finding: Aha moment appears at epoch 0. We observe that all models (except Llama-3.x series) already exhibit self-reflection patterns without any post-training.
</aside>
Qualitatively, we list all observed keywords that suggest self-reflection patterns in the following table. Note that this list may not be exhaustive. The keywords are verified by humans, and words like "wait" are filtered out because their presence may not necessarily signify self-reflection but could result from hallucinations. We note that different models display distinct keywords associated with self-reflection, which we hypothesize is affected by their pre-training data.