Startup World

The remarkable success of OpenAIs o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports.
Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored.
Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets.
These challenges complicate the effective scaling of RL methods for LLMs.Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO).
This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions.
The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both mathematical and code domains.
By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.Even more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.Challenges with Vanilla GRPOIn their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm.
However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels.
These issues included:Cross-Domain Optimization Conflicts (Math vs.
Code): Mathematical problems tend to elicit longer and more detailed reasoning trajectories (Long CoT), while code data exhibits a weaker inclination towards this.
Directly mixing these two data types led to conflicts, resulting in suboptimal performance in both domains.Reduced Training Efficiency due to Similar Group Rewards: The GRPO algorithm relies on the variance of non-zero rewards within a sampled group to calculate the advantage.
When rollouts within a group yield nearly identical reward values, the calculated advantage approaches zero.
If a significant portion of the training batch exhibits this phenomenon, effective gradient contributions become minimal, drastically reducing training efficiency.Premature Performance Saturation: GRPO training encountered early performance plateaus and reward saturation on benchmark evaluations.
This issue was partly attributed to insufficient data quality.
When the training data lacks sufficient complexity or diversity, particularly with an abundance of simpler problems, the model tends to conservatively maintain its performance on easier tasks, hindering its ability to develop the complex and in-depth reasoning required for challenging problems.Two-Staged TrainingTo address the inherent response length conflicts between mathematical and code domains, the Kwaipilot team implemented a two-stage training paradigm:Stage 1: Eliciting Reasoning Abilities: This initial training phase focuses exclusively on challenging mathematical data.
The primary goal is to fully incentivize the models test-time scaling, fostering capabilities such as reflective pausing, backtracking, and step-by-step decomposition.Stage 2: Skill Integration: In this stage, code data is introduced into the training process.
Building upon the reasoning foundation established in Stage 1, this phase aims to further enhance coding abilities while progressively strengthening procedural thinking, recursion, and tool-calling capabilities.Comparative Analysis of Training StrategiesThe impact of different training data strategies on response length was analyzed, revealing the following insights:Mixed Training: Models trained on a mixture of math and code data showed limited growth in response length and poor benchmark performance.
While math problems elicited some reasoning patterns, code problems often resulted in short, direct responses focused on immediate code output with minimal preliminary analysis or planning.Math-Only Training: Training solely on mathematical data led to a stable increase in response length and excellent performance on math benchmarks.
Crucially, it fostered strong and generalizable reasoning abilities; when faced with programming tasks, the model attempted detailed, step-by-step reasoning, including meticulous checking and revisiting steps in mathematical problem-solving.Code-Only Training: While showing improved performance on code benchmarks, the development of explicit reasoning behavior was minimal, and achieving significant increases in response length proved difficult.
Responses to both code and math problems were noticeably shorter compared to math-only training, with code solutions often being directly generated without substantial step-by-step reasoning or initial analysis.Staged Training: The two-stage training approach proposed by the Kwaipilot team yielded superior results in both mathematical and programming domains.
The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning patterns for programming tasks.
Notably, complex behaviors emerged, such as the model spontaneously utilizing code to assist in mathematical reasoning.History ResamplingThe Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical rewards.
This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling.
During training, they recorded the reward outcomes of all rollouts within each epoch.
At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:Filtering Overly Simple Samples: Samples where all rollouts resulted in correct answers were excluded, as they provided no informative signal for policy improvement.Retaining Informative Samples: Samples with diverse outcomes (both correct and incorrect) or all incorrect outcomes were retained.
These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals.
Furthermore, difficult samples where all rollouts were incorrect in the current epoch were also kept.
The rationale is that these initially challenging problems might become relatively easier for the updated policy, thus generating effective gradients in subsequent training.
This strategy aligns with the principle of curriculum learning, gradually exposing the model to increasingly challenging samples on average to enhance training efficiency.Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency and resulted in more stable response length growth.DataThe Kwaipilot team performed meticulous data cleaning and filtering on publicly available Code&Math datasets.
They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and answer ground truth) in the original data.
Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those requiring image or table understanding.
For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic logic.Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability of the answers, discarding those with incorrect or ambiguous solutions.
Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).Experimental ResultsThis section details the experimental results obtained using the SRPO method.
The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.Training ProcessThe figure above illustrates the complete reward curve and response length curve during SRPO training.
After the initial reward growth began to plateau, the training transitioned into the second stage.
At the beginning of the second stage, the overall reward decreased due to the models prior lack of training on code, followed by a steady increase in reward during subsequent training.
Integrating code data did not significantly increase the response length, which aligned with their expectations.
Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model, demonstrating the effectiveness of the new method.Specifically, History Resampling ensured that gradient updates remained effective at each training step, directly increasing the proportion of informative gradients.
This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the resampling strategy.Reasoning BehaviorsThe Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and exploration.
They statistically analyzed responses containing these patterns and recorded the average response length for each.
During RL training, they observed a gradual increase in the frequency of the models self-reflection, correction, and backtracking, indicating the emergence of a self-verification ability.
They posit that the emergence of reflection, akin to human cognitive processes, in the model during RL is an adaptive behavior resulting from the policy optimization process.As shown in the figure above, the model exhibited almost no proactive checking and reflection of previous reasoning steps in the early stages of training.
However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.Interestingly, they also discovered that the model learned to spontaneously use program code for verification when solving mathematical problems.
It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness of the solution.
These instances demonstrated the models ability to leverage procedural thinking for self-correction and multiple attempts, further indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based reasoning approaches for problem-solving.The Paper SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM is on arXivTry with the SRPO-Qwen-32BModel on HuggingFaceLike this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Insta360 drops significant summer upgrades for the X5 cam


More VMware cloud partners axed as Broadcom releases new invite-only program


2026 Mercedes-Benz CLA feels like a genuine cars and truck, not a science experiment


Google finds customized backdoor being installed on SonicWall network devices


Steam punish some sex games to appease payment processors


Permit for xAI's data center blatantly breaches Clean Air Act, NAACP states


Linda Hamilton rocks Stranger Things' S5 extended teaser


Introducing the Ars Technica Posting Guidelines version 3.0


Trump sues Corporation for Public Broadcasting directors who declined to be fired


Amazon's ride on the rocket merry-go-round continues with SpaceX launch


YouTuber faces jail time for revealing off Android-based gaming handhelds


Mac graphics settings for Cyberpunk 2077 aim for console-like simplicity


Stellantis abandons hydrogen fuel cell development


Rough roadway to energy dominance after GOP kneecaps wind and solar


Donkey Kong Bananza is a worthy follower to Super Mario Odyssey's tradition


There could be “dark main sequence” stars at the galactic center


Hackers make use of a blind spot by hiding malware inside DNS records


The ISS is nearing retirement, so why is NASA still gung-ho about StarlinerOne downside of


Airplane Teams with Kratos to Supply to German Air Force


DARPA Axes Giant Liberty Lifter Wing-In-Ground Effect Seaplane


EuroUSC Italia Rebrands as Unifly Consulting


New Algorithm for Vehicle Detection in Drone Aerial Views


1,500 Drone Light Show at Chongqing Low-Altitude Economy Expo


Three new DJI drones just dropped-- but not for you


Intuitive demos 4,000-mile telesurgery with da Vinci 5 surgical robot


After Intel exit, RealSense maps its own future in 3D vision


Unveiling the Tree of Robots: A new taxonomy for understanding robotic diversity


Diligent Robotics completes 300,000 drug store shipments with Moxi


Scale AI lays off 14% of staff, largely in data-labeling business


Ex-Waymo engineers launch Bedrock Robotics with $80M to automate construction


Transit software startup Via confidentially files for an IPO


Rex Salisbury’s Cambrian Ventures raises new fund, bucking fintech slowdown


Hugging Face bets on charming robots to bring open source AI to life


Medieval preacher invoked chivalric hero as a meme in preaching


Large research study squashes anti-vaccine talking points about aluminum


Congress moves to reject bulk of White House’s proposed NASA cuts


Seagate's enormous, 30TB, $600 hard disks are now readily available for anybody to buy


Chinese companies hurry for Nvidia chips as US prepares to raise ban


Corporate insufficiency has actually rendered my preferred discovered device ineffective


From the healthcare facility to the automobile plant: What is GM making with CT scanners


Grok's MechaHitler disaster didn't stop xAI from winning $200M military deal


GOP's pro-industry crypto bills might economically mess up millions, lawmaker alerts


'Not that into peace doves': The Apollo-Soyuz spot NASA rejected


BYD has caught up with Tesla in the global EV race. Here's how.


We saw the heart of Pluto 10 years earlier-- it'll be a long haul to see the rest


Pebblebee tracker’s new SOS alert reminds us that updates can be good for gadgets


Merger of 2 huge black holes is one for the record books


Reddit's UK users must now prove they're 18 to view adult material


Study finds AI tools made open source software designers 19 percent slower


Nvidia chips become the very first GPUs to fall to Rowhammer bit-flip attacks


Cosmetic surgeon off the hook for supposed COVID fraud, injecting kids with saline


Why Gov. Greg Abbott won't release his e-mails with Elon Musk


Office issues on Windows 10 Microsoft's response will soon be upgrade to 11.


ZenaTech Creates First Quantum Computing Prototype Enabling Disruptive AI Drone Speed and Precision


German AI Strike Drones Maker Stark Acquires Berlin AI Startup


Super Quiet Special Operations Drones Being Modified to Launch Smaller Drones


Accompanying Drone Swarm Formation


Wingtra Launches WingtraRAY


UAV Navigation-Grupo Oesía Provides the Flight Control System for French Fire-Fighting Missions


UK Royal Navy Tests Malloy Drone for Maritime Logistics on HMS Prince of Wales Aircraft Carrier


Vodafone and Manna Test Drone Home Delivery


Ukraine Captures Russian Troops Using Only Drones and Robots


Department of Homeland Security Issues Drone as First Responder (DFR) Systems Tech Note


Joby Aviation to double eVTOL production across two states, leveraging Toyota's proficiency


United States releases national security probe into DJI, Autel drones


Republicans demand urgent review of DJI, Autel drones [Updated]


DJI’s next product launch is all about digital worlds


SS Innovations exceeds 100 surgical robot implementations


maxon Group obtains minority stake in Synapticon


Inside Advanced Navigation's coral loss discovery in the world's southernmost reefs


XTEND secures extension to Series B to scale self-governing tactical robots


Zimmer Biomet to acquire Monogram Technologies for $177M


Liquid AI releases on-device foundation model LFM2


Global industrial robot market contracted 5.8% last year, reports Interact Analysis


MOTOR Ai gets seed funding towards explainable self-driving software application


A former OpenAI engineer describes what it’s really like to work there


Rwazi raises $12M Series A to assist companies with consumer insights and intelligence


China’s Geely is officially bringing its luxury EV startup Zeekr private


Auriga Space raises $6M to shoot rockets off an electromagnetic launch track


A Technology NewsRoom All Stage introduces in Boston today-- do not miss what founders are discovering


As the internet browser wars heat up, here are the most popular alternatives to Chrome and Safari in 2025


Rainmaker partners with Atmo to squeeze more rain from clouds


The votes remain in: A Technology NewsRoom Disrupt 2025 Audience Choice winners exposed for roundtables and breakouts


Tomorrow: A Technology NewsRoom All Stage launches in Boston — and ticket prices rise


Species at 30 makes for a great guilty pleasure


A new Martian climate model suggest a mostly cold, harsh environment


Pharm Robotics advances automated dairy cow healthcare


Meta acquires voice startup Play AI


The countdown’s almost over: 2 days until A Technology NewsRoom All Stage 2025 kicks off in Boston


5 huge EV takeaways from Trump's One Big Beautiful Bill


AI therapy bots fuel delusions and provide harmful recommendations, Stanford study discovers


Male's heart stopped after typical bacterium triggered ultra-rare infection


New Windows 11 build adds self-healing “quick machine recovery” feature


Belkin shows tech firms getting too comfortable with bricking customers’ stuff


Evaluation: Stellar cast makes Superman shine bright


Trump's DOJ appears irritated about having to authorize T-Mobile's most current merger


A mess of its own making: Google nerfs second Pixel phone battery this year


RFK Jr. may be about to destroy preventive health panel, health groups fear


Lamborghini follows effective racing Huracan with new Temerario GT3


Rocket Report: SpaceX to make its own propellant; China's largest launch pad


In the Southwest, photovoltaic panels can help both photovoltaics and crops


It’s hunting season in orbit as Russia’s killer satellites mystify skywatchers


It's a heist : Senator calls out Texas for trying to steal shuttle from Smithsonian


Lady takes 10x dose of turmeric, gets hospitalized for liver damage


Almost everyone opposes Trump's strategy to eliminate area traffic control program


Pro basketball gamer and 4 youths jailed in connection to ransomware crimes


Police officers' preferred AI tool immediately erases proof of when AI was utilized


T-Mobile follows orders from Trump FCC, ends DEI to get 2 mergers approved


Life after two-stroke: Rotax energizes its bike and kart powertrains


Shield AI V-BAT Selected by Netherlands Ministry of Defence for Navy and Marine Corps


DARPA Picks Bell Textron for New Runway-Less Drone X-Plane


TEKEVER to Build New Centre of Excellence in France


Turkey's Baykar Kemankeş-1 Loitering Munition Adds Air-to-Air Strike Role


Bio-Hybrid Drone Uses Silkworm Moth Antennae to Navigate Using Smell


The Plane With the Most Dangerous Wing in the World


A-10 Warthog Already Has Capability to Use Laser-Guided Rockets to Down Drones


New Ukrainian Jammer Blasts Radio Noise in a Powerful Stream to Deflect Russia's 'Miracle' Glide-Bombs


NTT Demonstrates World’s First Successful Lightning Triggering and Guidance Using a Drone


Estonia’s Marduk Participates in Swedish C-UAS Demonstration


New FAA leader faces big drone decisions ahead


This DJI Mini 3 drone offer simply got even less expensive


WingtraRAY: New study drone ends waiver delays and rework


DJI Mavic 3 Pro utilized in Cape Canaveral drone spying case


Is this drone tracker a threat to your safety


DJI releases Power 1000 V2, but not for United States buyers


Why DJI Neo is the very best pocket drone deal this Prime Day


HoverAir X1 drone gets enormous 33% Prime Day discount


DJI's biggest Prime Day sale ever: From $159 drones to $319 power stations


Save 15% on dual-camera DJI Air?3 drone this Prime Day


Prime Day: Save $90 on DJI Mini 4K Fly More Combo


Get a DJI drone for $159-- yes, truly


Apera AI updates Apera Forge design and AI training studio


BlackBerry QNX is positive on robotic surgical treatment however states autonomy isn't here yet


Patterns in supply chain robotics with John Santagate of infios


Kraken Robotics nets $115M for marine systems in public offering


TRI: pretrained large behavior models accelerate robot learning


Hugging Face launches Reachy Mini robot as embodied AI platform


Diligent Robotics hires 2 former Cruise execs to scale Moxi


Connect Analysis slashes its mobile robotic outlook amid tariff uncertainty


Outrider styles security system for autonomous yard trucks


GFT Technologies and NEURA Robotics partner to build software for physical AI


Nimble moves to cloud-based PTC development tools for logistics robots


Johns Hopkins teaches robotic to carry out a gallbladder elimination on a practical patient


Cobionix ready to expand with $3M for healthcare robotics


Augmentus raises Series A+ funding to reduce robot programming complexity


JAXA tests PickNik's MoveIt Pro software in multi-armed robotic system for the ISS


Simply 3 days left to conserve before A Technology NewsRoom All Stage 2025 illuminate Boston


AI leadership development platform Praxis Labs sells to Torch


A cloud-seeding startup did not trigger the Texas floods


Hugging Face's new robot is the Seinfeld of AI gadgets


Goldman Sachs is testing viral AI agent Devin as a ‘new employee’


Medium’s CEO explains what it took to stop losing $2.6M monthly


Startups Weekly: Still running


Julie Wainwright is building what comes next — join her fireside chat at A Technology NewsRoom Disrupt 2025


Humanoids, AVs, and what's next in AI hardware at A Technology NewsRoom Disrupt 2025


Helios wants to be the AI operating system for public policy professionals


Just 4 days until A Technology NewsRoom All Stage kicks off in Boston-- and the lowest ticket rates disappear


Where AI fulfills style: Runway co-founder Alejandro Matamala Ortiz takes the AI Stage at A Technology NewsRoom Disrupt 2025


How to really raise a seed round: Actionable advice from leading investors at A Technology NewsRoom Disrupt 2025


5 days till A Technology NewsRoom All Stage-- save as much as $475 before costs increase


Knox lands $6.5M to compete with Palantir in the federal compliance market


Why Cluely’s Roy Lee isn’t sweating cheating detectors


SaaS is in the past. The future belongs to representatives, states Narada AI's CEO.


Pinecone founder Edo Liberty checks out the genuine missing link in enterprise AI at A Technology NewsRoom Disrupt 2025


Get your exhibit table at A Technology NewsRoom Disrupt 2025


Discover how to prevent the mistakes that stall start-up fundraising at A Technology NewsRoom All Stage on July 15


Rivian spinoff Also raises another $200M to build e-bikes and more


LangChain is about to become a unicorn, sources state


Thank you to the visionaries: Celebrating the partners behind A Technology NewsRoom All Stage


Wayve CEO Alex Kendall brings the future of autonomous AI to A Technology NewsRoom Disrupt 2025


The complete Side Events lineup at A Technology NewsRoom All Stage 2025


Exploring the future of voice AI with Mati Staniszewski at A Technology NewsRoom Disrupt 2025


Moonvalley's 'ethical' AI video design for filmmakers is now publicly readily available


Jeff Chow of Miro shares how group intelligence drives better product-building at A Technology NewsRoom All Stage


7 days until doors open at A Technology NewsRoom All Stage-- and approximately $475 in ticket cost savings disappear