Startup World

The Transformer architecture, introduced by Vaswani et al.
in 2017, serves as the backbone of contemporary language models.
Over the years, numerous modifications to this architecture have been proposed to enhance aspects such as training stability, inference efficiency, context length, and robustness.In a new paper nGPT: Normalized Transformer with Representation Learning on the Hypersphere, an NVIDIA research team proposes the normalized Transformer (nGPT), which consolidates key findings in Transformer research under a unified framework, offering faster learning and reduced training stepsby factors ranging from 4 to 20 depending on sequence length.The researchers summarize their main contributions as follows:Hypersphere-Based Normalization: The core advancement of nGPT lies in normalizing all embedding dimensions to reside on a unit hypersphere.
This approach ensures consistent dimensionality across matrices and interprets matrix-vector multiplications as cosine similarities within the bounded range of [-1,1].
Notably, this normalization eliminates the need for weight decay by maintaining intrinsic stability.Mitigating Non-Linear Constraints: While normalization standardizes embeddings, it also constrains the inputs to non-linear units.
To address this, scaling factors are introduced, balancing these constraints and enhancing the models flexibility.Variable-Metric Optimization: Inspired by recent studies that position Transformers as meta-optimizers, the research team demonstrates that nGPT functions as a variable-metric optimizer.
Specifically:Gradient Information: Each transformation block computes gradients.Eigen Learning Rates: These gradients are scaled using learnable eigen learning rates derived from a variable-metric matrix.Riemannian Retraction: Normalization acts as a retraction step in Riemannian optimization, projecting outputs back onto the hypersphere.
This process transforms nGPT into a data-driven optimizer, fine-tuning its outputs with precision.One of nGPTs standout features is its remarkable efficiency in training.
By leveraging hypersphere-based normalization and optimizing using eigen learning rates, the model achieves the same accuracy with up to 20 times fewer training steps.
Furthermore, this hypersphere representation offers a deeper understanding of the models internal mechanics, enabling advanced statistical analysis and the application of hypersphere-specific mathematical tools.The introduction of the normalized Transformer opens new avenues for exploration in language model optimization.
By framing embedding transformations as operations on a hypersphere, nGPT not only improves computational efficiency but also paves the way for more robust and interpretable architectures.
This work highlights the potential of geometric insights in driving innovations in machine learning.The paper nGPT: Normalized Transformer with Representation Learning on the Hypersphere is on arXiv.


Author: Hecate He |Editor: Chain Zhang
Like this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Fortnite will return to iOS as court slams Apple's disturbance and cover-up


If you’re in the market for a $1,900 color E Ink monitor, one of them exists now


DNA links modern pueblo dwellers to Chaco Canyon people


Raspberry Pi cuts product returns by 50% by altering its pin soldering


Research study roundup: Tattooed tardigrades and splash-free urinals


Sundar Pichai says DOJ demands are a “de facto” spin-off of Google search


Windows RDP lets you log in utilizing withdrawed passwords. Microsoft is OK with that.The ability to use a withdrawed password to visit through RDP takes place when a Windows maker that's checked in with a Microsoft or Azure account is configured to allow


RFK Jr. rejects cornerstone of health science: Germ theory


Millions of Apple Airplay-enabled devices can be hacked via Wi-Fi


NASA just swapped a 10-year-old Artemis II engine with one nearly twice its age


CBS owner Paramount reportedly intends to settle Trump’s $20 billion lawsuit


Nintendo imposes new limits on sharing for digital Switch games


After convincing senators he supports Artemis, Isaacman election advances


First Amendment doesn’t just protect human speech, chatbot maker argues


Republicans want to tax EV drivers $200/year in new transport bill


The end of an AI that shocked the world: OpenAI retires GPT-4


Redditor accidentally reinvents discarded ’90s tool to escape today’s age gates


Intel says it’s rolling out laptop GPU drivers with 10% to 25% better performance


OpenAI rolls back update that made ChatGPT a sycophantic mess


Baykar and Leonardo Partnership Officially Exchanged at Turkey – Italy Intergovernmental Summit


GA-ASI Delivers MQ-9A Block 5 Extended Range UAS to USMC


US Army Selects Near Earth Autonomy and Honeywell to Deliver Autonomous Black Hawk Logistics Solution


NASA Tests Ultralight Antennas


Altitude Angel and AirHub Sign Partnership Agreement


Piasecki Aircraft Acquires Kaman Air Vehicles' KARGO UAV Program


MBDA Invests in UK’s Hydra Drones


UK Royal Navy Jet-Powered Drones Project Completed


Volz Servos Gets EN/AS 9100 Aviation Certificate


China Unveils Thermos Drone


Why DJI drone batteries drain themselves


FlytBase intros $99/month plan to scale remote drones


Your guide to Day 1 of the 2025 Robotics Summit Expo


A guide to everything going on at the 2025 Robotics Summit Expo


NexCOBOT to demonstrate EtherCAT AI robot controllers at Robotics Summit


BurgerBots opens restaurant with ABB robots preparing fast food


Epson adds GX-C Series with RC800A controller to its robot line


DeepSeek Unveils DeepSeek-Prover-V2: Advancing Neural Theorem Proving with Recursive Proof Search and a New Benchmark


Sam Altman's World unveils a mobile verification gadget


Gruve.ai guarantees software-like margins for AI tech consulting, interfering with decades-old Industry


The increase of retail financiers in secondaries, and why postponed IPOs will end up being the standard


Social Agent's new app lets you book a photographer within 30 minutes


Cast your vote: Help shape the A Technology NewsRoom All Stage agenda


Side Event submission deadline extended for A Technology NewsRoom Sessions: AI


5 days left: $210 ticket discount rate and 50% off on the second for A Technology NewsRoom Sessions AI


Nuvo, a network for B2B trade, has nabbed $34M from Sequoia and Spark Capital


Supio, an AI-powered legal analysis platform, lands $60M


AI sales tax startup Kintsugi has doubled its valuation in 6 months