Startup World

The remarkable success of OpenAIs o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports.
Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored.
Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets.
These challenges complicate the effective scaling of RL methods for LLMs.Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO).
This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions.
The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both mathematical and code domains.
By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.Even more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.Challenges with Vanilla GRPOIn their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm.
However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels.
These issues included:Cross-Domain Optimization Conflicts (Math vs.
Code): Mathematical problems tend to elicit longer and more detailed reasoning trajectories (Long CoT), while code data exhibits a weaker inclination towards this.
Directly mixing these two data types led to conflicts, resulting in suboptimal performance in both domains.Reduced Training Efficiency due to Similar Group Rewards: The GRPO algorithm relies on the variance of non-zero rewards within a sampled group to calculate the advantage.
When rollouts within a group yield nearly identical reward values, the calculated advantage approaches zero.
If a significant portion of the training batch exhibits this phenomenon, effective gradient contributions become minimal, drastically reducing training efficiency.Premature Performance Saturation: GRPO training encountered early performance plateaus and reward saturation on benchmark evaluations.
This issue was partly attributed to insufficient data quality.
When the training data lacks sufficient complexity or diversity, particularly with an abundance of simpler problems, the model tends to conservatively maintain its performance on easier tasks, hindering its ability to develop the complex and in-depth reasoning required for challenging problems.Two-Staged TrainingTo address the inherent response length conflicts between mathematical and code domains, the Kwaipilot team implemented a two-stage training paradigm:Stage 1: Eliciting Reasoning Abilities: This initial training phase focuses exclusively on challenging mathematical data.
The primary goal is to fully incentivize the models test-time scaling, fostering capabilities such as reflective pausing, backtracking, and step-by-step decomposition.Stage 2: Skill Integration: In this stage, code data is introduced into the training process.
Building upon the reasoning foundation established in Stage 1, this phase aims to further enhance coding abilities while progressively strengthening procedural thinking, recursion, and tool-calling capabilities.Comparative Analysis of Training StrategiesThe impact of different training data strategies on response length was analyzed, revealing the following insights:Mixed Training: Models trained on a mixture of math and code data showed limited growth in response length and poor benchmark performance.
While math problems elicited some reasoning patterns, code problems often resulted in short, direct responses focused on immediate code output with minimal preliminary analysis or planning.Math-Only Training: Training solely on mathematical data led to a stable increase in response length and excellent performance on math benchmarks.
Crucially, it fostered strong and generalizable reasoning abilities; when faced with programming tasks, the model attempted detailed, step-by-step reasoning, including meticulous checking and revisiting steps in mathematical problem-solving.Code-Only Training: While showing improved performance on code benchmarks, the development of explicit reasoning behavior was minimal, and achieving significant increases in response length proved difficult.
Responses to both code and math problems were noticeably shorter compared to math-only training, with code solutions often being directly generated without substantial step-by-step reasoning or initial analysis.Staged Training: The two-stage training approach proposed by the Kwaipilot team yielded superior results in both mathematical and programming domains.
The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning patterns for programming tasks.
Notably, complex behaviors emerged, such as the model spontaneously utilizing code to assist in mathematical reasoning.History ResamplingThe Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical rewards.
This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling.
During training, they recorded the reward outcomes of all rollouts within each epoch.
At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:Filtering Overly Simple Samples: Samples where all rollouts resulted in correct answers were excluded, as they provided no informative signal for policy improvement.Retaining Informative Samples: Samples with diverse outcomes (both correct and incorrect) or all incorrect outcomes were retained.
These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals.
Furthermore, difficult samples where all rollouts were incorrect in the current epoch were also kept.
The rationale is that these initially challenging problems might become relatively easier for the updated policy, thus generating effective gradients in subsequent training.
This strategy aligns with the principle of curriculum learning, gradually exposing the model to increasingly challenging samples on average to enhance training efficiency.Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency and resulted in more stable response length growth.DataThe Kwaipilot team performed meticulous data cleaning and filtering on publicly available Code&Math datasets.
They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and answer ground truth) in the original data.
Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those requiring image or table understanding.
For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic logic.Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability of the answers, discarding those with incorrect or ambiguous solutions.
Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).Experimental ResultsThis section details the experimental results obtained using the SRPO method.
The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.Training ProcessThe figure above illustrates the complete reward curve and response length curve during SRPO training.
After the initial reward growth began to plateau, the training transitioned into the second stage.
At the beginning of the second stage, the overall reward decreased due to the models prior lack of training on code, followed by a steady increase in reward during subsequent training.
Integrating code data did not significantly increase the response length, which aligned with their expectations.
Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model, demonstrating the effectiveness of the new method.Specifically, History Resampling ensured that gradient updates remained effective at each training step, directly increasing the proportion of informative gradients.
This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the resampling strategy.Reasoning BehaviorsThe Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and exploration.
They statistically analyzed responses containing these patterns and recorded the average response length for each.
During RL training, they observed a gradual increase in the frequency of the models self-reflection, correction, and backtracking, indicating the emergence of a self-verification ability.
They posit that the emergence of reflection, akin to human cognitive processes, in the model during RL is an adaptive behavior resulting from the policy optimization process.As shown in the figure above, the model exhibited almost no proactive checking and reflection of previous reasoning steps in the early stages of training.
However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.Interestingly, they also discovered that the model learned to spontaneously use program code for verification when solving mathematical problems.
It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness of the solution.
These instances demonstrated the models ability to leverage procedural thinking for self-correction and multiple attempts, further indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based reasoning approaches for problem-solving.The Paper SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM is on arXivTry with the SRPO-Qwen-32BModel on HuggingFaceLike this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Why MFA is getting easer to bypass and what to do about it


Brand-new research study implicates LM Arena of video gaming its popular AI benchmark


Don’t watermark your legal PDFs with purple dragons in suits


New material might assist us construct Predator-style thermal vision specs


Sen. Susan Collins blasts Trump for cuts to scientific research


The 2025 Aston Martin Vantage: Achingly beautiful and thrilling to drive


Neanderthals invented their own bone weapon innovation by 80,000 years earlier


Google is silently checking advertisements in AI chatbots


Gaming news site Polygon gutted by massive layoffs amid sale to Valnet


Meet the winners of the 2025 Dance Your PhD contest


Tesla denies trying to change Elon Musk as CEO


Microsoft raises prices on Xbox hardware, says “some” holiday games will be $80


Collaborative Combat Aircraft Start Ground Testing and Aircraft Readiness Unit to be Located at Beale AFB


GA-ASI Statement on USAF CCA Program Updates


Airbus, Shield AI Partner to Integrate Autonomy on Unmanned Aerial Logistics Connector


Ship Bottom Inspection Using Water-Air Integrated Drone


NASA Studies Wind Effects and Aircraft Tracking with Joby Aircraft


Douglas SBD Dauntless – the Dive Bomber they Thought was a Joke – Until it Sank their Entire Fleet


Brand-new American drones offer longer flight, larger payload than DJI


Is DJI working on a 360 camera?


400,000 special DJI drones are in use in the agricultural industry


Your guide to Day 2 of the 2025 Robotics Summit Expo


Increasing star defense tech start-up Mach Industries is raising $100 million, sources say


Fintech Bench conducts layoff while others still work month-to-month


Fivetran acquires Census to become end-to-end data movement platform


Last call to volunteer at A Technology NewsRoom Sessions: AI


NASA’s Psyche spacecraft hits a speed bump on the way to a metal asteroid


Fortnite will return to iOS as court slams Apple's disturbance and cover-up


If you’re in the market for a $1,900 color E Ink monitor, one of them exists now


DNA links modern pueblo dwellers to Chaco Canyon people


Raspberry Pi cuts product returns by 50% by altering its pin soldering


Research study roundup: Tattooed tardigrades and splash-free urinals


Sundar Pichai says DOJ demands are a “de facto” spin-off of Google search


Windows RDP lets you log in utilizing withdrawed passwords. Microsoft is OK with that.The ability to use a withdrawed password to visit through RDP takes place when a Windows maker that's checked in with a Microsoft or Azure account is configured to allow


RFK Jr. rejects cornerstone of health science: Germ theory


Millions of Apple Airplay-enabled devices can be hacked via Wi-Fi


NASA just swapped a 10-year-old Artemis II engine with one nearly twice its age


CBS owner Paramount reportedly intends to settle Trump’s $20 billion lawsuit


Nintendo imposes new limits on sharing for digital Switch games


After convincing senators he supports Artemis, Isaacman election advances


First Amendment doesn’t just protect human speech, chatbot maker argues


Republicans want to tax EV drivers $200/year in new transport bill


The end of an AI that shocked the world: OpenAI retires GPT-4


Redditor accidentally reinvents discarded ’90s tool to escape today’s age gates


Intel says it’s rolling out laptop GPU drivers with 10% to 25% better performance


OpenAI rolls back update that made ChatGPT a sycophantic mess


Baykar and Leonardo Partnership Officially Exchanged at Turkey – Italy Intergovernmental Summit


GA-ASI Delivers MQ-9A Block 5 Extended Range UAS to USMC


US Army Selects Near Earth Autonomy and Honeywell to Deliver Autonomous Black Hawk Logistics Solution


NASA Tests Ultralight Antennas


Altitude Angel and AirHub Sign Partnership Agreement


Piasecki Aircraft Acquires Kaman Air Vehicles' KARGO UAV Program


MBDA Invests in UK’s Hydra Drones


UK Royal Navy Jet-Powered Drones Project Completed


Volz Servos Gets EN/AS 9100 Aviation Certificate


China Unveils Thermos Drone


Why DJI drone batteries drain themselves


FlytBase intros $99/month plan to scale remote drones


Your guide to Day 1 of the 2025 Robotics Summit Expo


A guide to everything going on at the 2025 Robotics Summit Expo


NexCOBOT to demonstrate EtherCAT AI robot controllers at Robotics Summit


BurgerBots opens restaurant with ABB robots preparing fast food


Epson adds GX-C Series with RC800A controller to its robot line


DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark


Sam Altman's World unveils a mobile verification gadget


Gruve.ai guarantees software-like margins for AI tech consulting, interfering with decades-old Industry


The increase of retail financiers in secondaries, and why postponed IPOs will end up being the standard


Social Agent's new app lets you book a photographer within 30 minutes


Cast your vote: Help shape the A Technology NewsRoom All Stage agenda


Side Event submission deadline extended for A Technology NewsRoom Sessions: AI


5 days left: $210 ticket discount rate and 50% off on the second for A Technology NewsRoom Sessions AI


Nuvo, a network for B2B trade, has nabbed $34M from Sequoia and Spark Capital


Supio, an AI-powered legal analysis platform, lands $60M


AI sales tax startup Kintsugi has doubled its valuation in 6 months