The remarkable success of OpenAI’s o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).

However, the core training methodologies behind these groundbreaking reasoning...