Startup World

The remarkable success of OpenAIs o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports.
Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored.
Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets.
These challenges complicate the effective scaling of RL methods for LLMs.Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO).
This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions.
The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both mathematical and code domains.
By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.Even more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.Challenges with Vanilla GRPOIn their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm.
However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels.
These issues included:Cross-Domain Optimization Conflicts (Math vs.
Code): Mathematical problems tend to elicit longer and more detailed reasoning trajectories (Long CoT), while code data exhibits a weaker inclination towards this.
Directly mixing these two data types led to conflicts, resulting in suboptimal performance in both domains.Reduced Training Efficiency due to Similar Group Rewards: The GRPO algorithm relies on the variance of non-zero rewards within a sampled group to calculate the advantage.
When rollouts within a group yield nearly identical reward values, the calculated advantage approaches zero.
If a significant portion of the training batch exhibits this phenomenon, effective gradient contributions become minimal, drastically reducing training efficiency.Premature Performance Saturation: GRPO training encountered early performance plateaus and reward saturation on benchmark evaluations.
This issue was partly attributed to insufficient data quality.
When the training data lacks sufficient complexity or diversity, particularly with an abundance of simpler problems, the model tends to conservatively maintain its performance on easier tasks, hindering its ability to develop the complex and in-depth reasoning required for challenging problems.Two-Staged TrainingTo address the inherent response length conflicts between mathematical and code domains, the Kwaipilot team implemented a two-stage training paradigm:Stage 1: Eliciting Reasoning Abilities: This initial training phase focuses exclusively on challenging mathematical data.
The primary goal is to fully incentivize the models test-time scaling, fostering capabilities such as reflective pausing, backtracking, and step-by-step decomposition.Stage 2: Skill Integration: In this stage, code data is introduced into the training process.
Building upon the reasoning foundation established in Stage 1, this phase aims to further enhance coding abilities while progressively strengthening procedural thinking, recursion, and tool-calling capabilities.Comparative Analysis of Training StrategiesThe impact of different training data strategies on response length was analyzed, revealing the following insights:Mixed Training: Models trained on a mixture of math and code data showed limited growth in response length and poor benchmark performance.
While math problems elicited some reasoning patterns, code problems often resulted in short, direct responses focused on immediate code output with minimal preliminary analysis or planning.Math-Only Training: Training solely on mathematical data led to a stable increase in response length and excellent performance on math benchmarks.
Crucially, it fostered strong and generalizable reasoning abilities; when faced with programming tasks, the model attempted detailed, step-by-step reasoning, including meticulous checking and revisiting steps in mathematical problem-solving.Code-Only Training: While showing improved performance on code benchmarks, the development of explicit reasoning behavior was minimal, and achieving significant increases in response length proved difficult.
Responses to both code and math problems were noticeably shorter compared to math-only training, with code solutions often being directly generated without substantial step-by-step reasoning or initial analysis.Staged Training: The two-stage training approach proposed by the Kwaipilot team yielded superior results in both mathematical and programming domains.
The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning patterns for programming tasks.
Notably, complex behaviors emerged, such as the model spontaneously utilizing code to assist in mathematical reasoning.History ResamplingThe Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical rewards.
This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling.
During training, they recorded the reward outcomes of all rollouts within each epoch.
At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:Filtering Overly Simple Samples: Samples where all rollouts resulted in correct answers were excluded, as they provided no informative signal for policy improvement.Retaining Informative Samples: Samples with diverse outcomes (both correct and incorrect) or all incorrect outcomes were retained.
These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals.
Furthermore, difficult samples where all rollouts were incorrect in the current epoch were also kept.
The rationale is that these initially challenging problems might become relatively easier for the updated policy, thus generating effective gradients in subsequent training.
This strategy aligns with the principle of curriculum learning, gradually exposing the model to increasingly challenging samples on average to enhance training efficiency.Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency and resulted in more stable response length growth.DataThe Kwaipilot team performed meticulous data cleaning and filtering on publicly available Code&Math datasets.
They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and answer ground truth) in the original data.
Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those requiring image or table understanding.
For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic logic.Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability of the answers, discarding those with incorrect or ambiguous solutions.
Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).Experimental ResultsThis section details the experimental results obtained using the SRPO method.
The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.Training ProcessThe figure above illustrates the complete reward curve and response length curve during SRPO training.
After the initial reward growth began to plateau, the training transitioned into the second stage.
At the beginning of the second stage, the overall reward decreased due to the models prior lack of training on code, followed by a steady increase in reward during subsequent training.
Integrating code data did not significantly increase the response length, which aligned with their expectations.
Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model, demonstrating the effectiveness of the new method.Specifically, History Resampling ensured that gradient updates remained effective at each training step, directly increasing the proportion of informative gradients.
This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the resampling strategy.Reasoning BehaviorsThe Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and exploration.
They statistically analyzed responses containing these patterns and recorded the average response length for each.
During RL training, they observed a gradual increase in the frequency of the models self-reflection, correction, and backtracking, indicating the emergence of a self-verification ability.
They posit that the emergence of reflection, akin to human cognitive processes, in the model during RL is an adaptive behavior resulting from the policy optimization process.As shown in the figure above, the model exhibited almost no proactive checking and reflection of previous reasoning steps in the early stages of training.
However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.Interestingly, they also discovered that the model learned to spontaneously use program code for verification when solving mathematical problems.
It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness of the solution.
These instances demonstrated the models ability to leverage procedural thinking for self-correction and multiple attempts, further indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based reasoning approaches for problem-solving.The Paper SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM is on arXivTry with the SRPO-Qwen-32BModel on HuggingFaceLike this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


I assisted a lost canine's AirTag ping its owner: An ode to replaceable batteries


Trump admin tells Supreme Court: DOGE needs to do its work in secret


“Microsoft has simply given us no other option,” Signal says as it blocks Windows Recall


Scientists figure out how the brain forms emotional connections


The physics of frilly Swiss cheese flowers


Google pretends to be in on the joke, but its concentrate on AI Mode search is major


Apple legend Jony Ive takes control of OpenAI’s design future


Incredible shrinking clownfish beats the heat


Verizon tries to get out of merger condition requiring it to unlock phones


Paris Agreement target will not protect polar ice sheets, researchers alert


Brembo develops brakes with nearly no brake dust and less wear


SilverStone is back with a beige PC case that looks just like your crappy old 486


Meta hypes AI friends as social media’s future, but users want real connections


Toyota debuts all-new RAV4 with hybrid and PHEV powertrains just


AMD’s $299 Radeon RX 9060 XT brings 8GB or 16GB of RAM to fight the RTX 5060


How 3D printing is customizing health care


Self-hosting is having a minute. Ethan Sholly understands why.Self-hosting is having a minute, even if it's tough to specify precisely what it is. ... Self-hosting is having


Gemini 2.5 is leaving preview just in time for Google’s new $250 AI subscription


Trump’s trade war risks splintering the Internet, experts warn


DIU Announces Two-Tiered System for Blue UAS


uAvionix Releases muLTElink Upgrade, Enabling Seamless Autonomous C2 Link Management


United States Air Force Considers Air-Launched 'Fighter Drone' Collaborative Combat Aircraft


Progress Dynamics Launches Foxe and Wolfe Miniaturised UAS Platforms


Anduril Industries UK, Archer and Partners to Test Dual-Use eVTOL Cargo Aircraft in UK


GA-ASI Moves to Ground Testing of YFQ-42A CCA


China’s 4,500-Mile Range, 100-UAV Payload ‘Drone Mothership’ Nears First Test Flight


Amazon Removed Backup Landing Sensors Prior to Drone Crashes


Ukraine is Adding Anti-Drone Cages to Drones


Estonian Drone Plant will have Daily Capacity of 2,000


Can you buy a Mavic 4 Pro in the US? Maybe… it’s a pain


DJI drone shouts warnings, spots risks at snowy slopes [Video]


Rainbow Robotics unveils omnidirectional wheels, advancement kit for its dual-arm robot


10 robotics patterns identified at Automate 2025


Automate 2025: 5 methods cobots and AMRs top humanoid robotics


TRON1 robot extends its reach with a new optional arm


Simbe upgrades vision platform with AI-powered capabilities


Orbit 5.0 adds functions to Boston Dynamics' Spot quadruped robot


Alt Carbon ratings $12M seed to scale carbon removal in India


Serve is betting that food delivery and access to public markets are the keys to scaling robotics


LM Arena, the organization behind popular AI leaderboards, lands $100M


Filed raises $17M to automate the drudgery of tax prep


Meta launches program to encourage startups to utilize its Llama AI models


Siro lands $50M to expand its AI-powered coaching for sales reps


Einride founder actions down as CEO amid push to scale electric, autonomous trucks


A Technology NewsRoom Disrupt 2025 Early Bird savings end on May 25


Recently: Exhibit your start-up at A Technology NewsRoom Sessions AI


You’ve got 6 days to save $900 on A Technology NewsRoom Disrupt 2025 tickets


TwelveLabs CEO Jae Lee is coming to A Technology NewsRoom Sessions: AI


Amazon’s Danielle Perszyk is coming to A Technology NewsRoom Sessions: AI


Volunteer at A Technology NewsRoom All Stage in Boston


Keep raises $12M in progressively competitive Canadian corporate spend market


Google DeepMind’s Logan Kilpatrick is coming to A Technology NewsRoom Sessions: AI


2025 Hyundai Ioniq 9 very first drive: Efficient, for a huge one


Zero-click searches: Google's AI tools are the culmination of its hubris


Tuesday Telescope: Finally, some answers on those Martian streaks


Biotech company Regeneron to buy bankrupt 23andMe for $256M


Area Force authorities: Commercial satellites can do a lot more than we believed


Labor dispute erupts over AI-voiced Darth Vader in Fortnite


Trump to sign law forcing platforms to remove revenge porn in 48 hours


Microsoft closes 9-year-old function demand, open-sources Windows Subsystem for Linux


FCC Chair Brendan Carr is letting ISPs merge-- as long as they end DEI programs


F1 in Imola reminds us it's about technique as much as a fast cars and truck


The making of Apple TV+’s Murderbot


CERN gears up to ship antimatter across Europe


Removing the weakest link in electrified autonomous transport: Humans


Anno 117 Pax Romana hands-on: Gorgeous, deep, and tricky to learn


New Orleans called out for sketchiest usage of facial recognition yet in the US


Impressive litigates to force Fortnite back on US iOS


Professionals alarmed over Trump's promotion of deep-sea mining in global waters


UAV Navigation- Grupo Oesía Presents Guidance, Navigation and Control Systems at XPONENTIAL 2025


GA Integrates Software for USMC Common Intelligence Picture WTI Course


HENSOLDT to Upgrade ASUL C-UAS System for German Armed Forces


Ukrainian Fibre Optic FPV Drones Hit Turtle Tanks in Russian Repair Shop


Texas Startup Proposes Drone Defence Against School Shooters


How Zipline delivery drones are silently beating Amazon to the punch


Automate 2025 recap


Saildrone brings in $60M to expand European maritime existence


Orbbec, Connect Tech to provide support for Gemini stereo depth camera


Adaptation Ventures is a new angel financier group concentrated on impairment and accessibility tech


Brex partners with previous competitor Zip, with an eye on minimizing cash burn to get to an IPO


Alation acquires Numbers Station to boost its AI agent offerings


Gravitee, a platform that helps business manage APIs, raises $60M


Affiniti's 20- and 22-year-old creators raise $17M led by SignalFire just 6 months after an $11M seed


Sylndr, with fresh $15.7 M, permits users to purchase, sell, finance, and service utilized automobiles in Egypt


Clock's ticking: Save approximately $900 on A Technology NewsRoom Disrupt 2025 tickets before prices increase


Deel desires Rippling to turn over any agreements including paying the alleged spy


HBO's The Last of Us S2E6 recap: Look who's back!


Do these Buddhist gods mean the function of China's super-secret satellitesMission spots are a


Sierra made the games of my childhood. Are they still fun to play


RFK Jr’s plan to ban fluoride supplements will “hurt rural America,” dentists say


Spotify captured hosting numerous fake podcasts that market offering drugs


The empire strikes back with F-bombs: AI Darth Vader goes rogue with profanity, slurs


Google to offer app devs access to Gemini Nano for on-device AI


From birth to gene-edited in 6 months: Custom therapy breaks speed limits


OpenAI introduces Codex, its very first full-fledged AI agent for coding


Forgive me, Volvo, I was incorrect: The 2025 V60 Cross Country evaluation


Carnivorous crocodile-like beasts utilized to terrify the Caribbean


Meta argues enshittification isn't genuine in quote to toss FTC monopoly case


The 2025 VW Tiguan deals with United States tastes at an economical rate


Nintendo says more about how free Switch 2 updates will enhance Switch games


xAI states an unauthorized prompt change caused Grok to concentrate on white genocide


Drop Duchy is a deck-building, Tetris-like, Carcassonne-esque puzzler


Rocket Report: How is your payload fairing Poland launches test rocket.


The top fell off Australia’s first orbital-class rocket, delaying its launch


FBI alerts of continuous rip-off that uses deepfake audio to impersonate federal government officials


After latest abduct attempt, crypto types inform criminal offense bosses: Transfers are traceable


Raytheon to Build Coyote Factory in UAE


Ukraine’s AIM-9 Sidewinder-Armed Magura-7 Drone Boat on Display


Poland Buys 10,000 Warmate Loitering Ammunitions


QinetiQ Delivers 10,000 th Banshee


Chinese Student Xu Yang Breaks ‘Impossible’ Microdrone World Speed Record at 211 mph


S-100 CAMCOPTER Strengthens Greek FDI Frigate Capabilities


Lyten Announces Next-Generation Drone Propulsion Initiative with American-Made Lithium-Sulfur Batteries


uAvionix Trakr: Assured, Real-time Drone Monitoring Low-Altitude Airspace Awareness in FlightLine


Pierce Aerospace and MITRE Partner to Advance Remote ID Research and Development


ATD-150-- Brazil's First Fully Indigenous Jet-Powered Drone


Amazon shipment drones crashed after mistaking rain for ground: Report


New V-Line Pro delivers 10-hour flight time for DJI drone


DJI leaker shares his concept for the Inspire 4


DJI RC Pro 2 adds Air 3S, Mini 4 Pro support


Inside PG E&& s high-flying drone strategy to stop wildfires


Some U.S. merchants had the DJI Mavic 4 Pro for sale ... howBy the time you're reading this, it's unlikely that you'll have the ability to find any of the Mavic 4 Pros for sale at these areas, however for a minimal time, 3 websites had the drone for sale


UrbanLink nearly doubles order of REGENT electric seagliders to transport over 4M passengers a year


DJI Mavic 4 Pro gets feature-packed launch firmware upgrade


DJI Fly app update includes Mavic 4 Pro drone assistance


DJI Mavic 4 Pro flies in Europe with EASA C2 certification


7 big upgrades US purchasers will miss without DJI Mavic 4 Pro


The DJI Mavic 4 Pro is here, but U.S. buyers are left grounded


NVIDIA launches cloud-to-robot computing platforms for physical AI, humanoid advancement


NVIDIA accepts Ekso Bionics into its Connect program


RealMan displays embodied robotics at Automate 2025


Persona AI raises $27M to establish humanoid robotics for shipyards


ABB deploys PixelPaint at Mercedes-Benz plant in Germany


MIT engineers develop senior assist robotic E-BAR to avoid falls in the house


New allowing innovations from Automate 2025


Intuitive Surgical is making a CEO change


Waymo updates 1,200+ robotaxis in software application recall


Former UR president Povlsen joins quantum technology leader


RoboBusiness Pitchfire competition opens require robotics startups


DHL buying 1,000+ Stretch robots from Boston Dynamics


In spite of the hype, Interact Analysis anticipates humanoid adoption to stay slow


Piaggio Fast Forward releases Star Wars accredited android


DeepSeek-V3 New Paper is coming! Unveiling the Secrets of Low-Cost Large Model Training through Hardware-Aware Co-design


Y Combinator start-up Firecrawl is ready to pay $1M to work with 3 AI representatives as workers


AI startup Cohere acquires Ottogrid, a platform for conducting market research


The Nuclear Company raises $51M to establish enormous reactor websites


AI video startup Moonvalley lands $53M, according to filing


A Technology NewsRoom and VivaTech partner for the VivaTech Innovation of the Year


Is $1 billion a great deal of cash these daysDatabricks simply snatched up another AI company.This week, the information analytics giant announced a$1 billion acquisition of Neon, a start-up constructing an open source option to AWS Aurora Postgres. It's


Fake fired Twitter worker ‘Rahul Ligma’ is a real engineer with an AI data startup used by Harvard


Sprinter Health raises $55M to expand its at-home healthcare service


Startups Weekly: A brighter outlook, however do not get carried away


Bain bets on Indian domestic work startup Pronto even as rivals face criticism


Host a tailored Side Event at All Stage 2025 in Boston


Acorns acquires family wealth and digital memory platform EarlyBird


Unpacking Rippling vs Deel: business espionage and a $16.8 B plot twist


Tensor9 assists vendors deploy their software application into any environment using digital twins