Startup World

A screenshot of the 2025 USAMO Problem #1 and a solution, shown on the AoPSOnline website.
Credit: AoPSOnline The US Math Olympiad (USAMO) serves as a qualifier for the International Math Olympiad and presents a much higher bar than tests like the American Invitational Mathematics Examination (AIME).
While AIME problems are difficult, they require integer answers.
USAMO demands contestants write out complete mathematical proofs, scored for correctness, completeness, and clarity over nine hours and two days.The researchers evaluated several AI reasoning models on the six problems from the 2025 USAMO shortly after their release, minimizing any chance the problems were part of the models' training data.
These models included Qwen's QwQ-32B, DeepSeek R1, Google's Gemini 2.0 Flash Thinking (Experimental) and Gemini 2.5 Pro, OpenAI's o1-pro and o3-mini-high, Anthropic's Claude 3.7 Sonnet with Extended Thinking, and xAI's Grok 3.
An April 25, 2025, screenshot of the researchers' MathArena website showing accuracy scores for SR models on each problem in the USAMO.
Credit: MathArena While one model, Google's Gemini 2.5 Pro, achieved a higher average score of 10.1 out of 42 points (~24 percent), the results otherwise showed a massive performance drop compared to AIME-level benchmarks.
The other evaluated models lagged considerably further behind: DeepSeek R1 and Grok 3 averaged 2.0 points each, Google's Flash-Thinking scored 1.8, Anthropic's Claude 3.7 managed 1.5, while Qwen's QwQ and OpenAI's o1-pro both averaged 1.2 points.
OpenAI's o3-mini had the lowest average score at just 0.9 points (~2.1 percent).
Out of nearly 200 generated solutions across all tested models and runs, not a single one received a perfect score for any problem.While OpenAI's newly released 03 and o4-mini-highwere not examined for this study, benchmarks at the researchers' MathArena website show o3-high scoring 21.73 percent overall and o4-mini-high scoring 19.05 percent overall on USAMO.
However, those results are potentially contaminated because they were measured after the contest took place, meaning that the newer OpenAI models could potentially have included the solutions in the training data.In the paper, the researchers identified several key recurring failure patterns.
The AI outputs contained logical gaps where mathematical justification was lacking, included arguments based on unproven assumptions, and continued producing incorrect approaches despite generating contradictory results.A specific example involved USAMO 2025 Problem 5.
This problem asked models to find all positive whole numbers "k," such that a specific calculation involving sums of binomial coefficients raised to the power of "k" would always result in an integer, no matter which positive integer "n" was used.
On this problem, Qwen's QwQ model made a notable error: It incorrectly excluded non-integer possibilities at a stage where the problem statement allowed them.
This mistake led the model to an incorrect final answer despite having correctly identified the necessary conditions earlier in its reasoning process.





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Fortnite will return to iOS as court slams Apple's disturbance and cover-up


If you’re in the market for a $1,900 color E Ink monitor, one of them exists now


DNA links modern pueblo dwellers to Chaco Canyon people


Raspberry Pi cuts product returns by 50% by altering its pin soldering


Research study roundup: Tattooed tardigrades and splash-free urinals


Sundar Pichai says DOJ demands are a “de facto” spin-off of Google search


Windows RDP lets you log in utilizing withdrawed passwords. Microsoft is OK with that.The ability to use a withdrawed password to visit through RDP takes place when a Windows maker that's checked in with a Microsoft or Azure account is configured to allow


RFK Jr. rejects cornerstone of health science: Germ theory


Millions of Apple Airplay-enabled devices can be hacked via Wi-Fi


NASA just swapped a 10-year-old Artemis II engine with one nearly twice its age


CBS owner Paramount reportedly intends to settle Trump’s $20 billion lawsuit


Nintendo imposes new limits on sharing for digital Switch games


After convincing senators he supports Artemis, Isaacman election advances


First Amendment doesn’t just protect human speech, chatbot maker argues


Republicans want to tax EV drivers $200/year in new transport bill


The end of an AI that shocked the world: OpenAI retires GPT-4


Redditor accidentally reinvents discarded ’90s tool to escape today’s age gates


Intel says it’s rolling out laptop GPU drivers with 10% to 25% better performance


OpenAI rolls back update that made ChatGPT a sycophantic mess


Baykar and Leonardo Partnership Officially Exchanged at Turkey – Italy Intergovernmental Summit


GA-ASI Delivers MQ-9A Block 5 Extended Range UAS to USMC


US Army Selects Near Earth Autonomy and Honeywell to Deliver Autonomous Black Hawk Logistics Solution


NASA Tests Ultralight Antennas


Altitude Angel and AirHub Sign Partnership Agreement


Piasecki Aircraft Acquires Kaman Air Vehicles' KARGO UAV Program


MBDA Invests in UK’s Hydra Drones


UK Royal Navy Jet-Powered Drones Project Completed


Volz Servos Gets EN/AS 9100 Aviation Certificate


China Unveils Thermos Drone


Why DJI drone batteries drain themselves


FlytBase intros $99/month plan to scale remote drones


Your guide to Day 1 of the 2025 Robotics Summit Expo


A guide to everything going on at the 2025 Robotics Summit Expo


NexCOBOT to demonstrate EtherCAT AI robot controllers at Robotics Summit


BurgerBots opens restaurant with ABB robots preparing fast food


Epson adds GX-C Series with RC800A controller to its robot line


DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark


Sam Altman's World unveils a mobile verification gadget


Gruve.ai guarantees software-like margins for AI tech consulting, interfering with decades-old Industry


The increase of retail financiers in secondaries, and why postponed IPOs will end up being the standard


Social Agent's new app lets you book a photographer within 30 minutes


Cast your vote: Help shape the A Technology NewsRoom All Stage agenda


Side Event submission deadline extended for A Technology NewsRoom Sessions: AI


5 days left: $210 ticket discount rate and 50% off on the second for A Technology NewsRoom Sessions AI


Nuvo, a network for B2B trade, has nabbed $34M from Sequoia and Spark Capital


Supio, an AI-powered legal analysis platform, lands $60M


AI sales tax startup Kintsugi has doubled its valuation in 6 months