INSUBCONTINENT EXCLUSIVE:

The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for

advanced navigation systems

However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the

limitations of traditional approaches

innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile

robots.Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of

target localization, self-localization, and path planning

Target localization involves understanding natural language or image cues to pinpoint a destination on a map

Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like

warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes)

Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and

reaching intermediate waypoints.While foundation models have shown promise in integrating smaller models to tackle broader tasks, the

optimal number of models and their effective integration for comprehensive navigation remained an open question

https://astra-mobility.github.io/), addresses these limitations

Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local

Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local

path planning and odometry estimation

This architecture promises to revolutionize how robots navigate complex indoor spaces.Astra-Global: The Intelligent Brain for Global

LocalizationAstra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks:

self-localization and target localization

It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global

positioning within a map

Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions

based on query images or text prompts.The construction of this robust localization system begins with offline mapping

The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):V (Nodes): Keyframes, obtained by

temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and

landmark references.E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.L

understanding

These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.In practical localization,

localization

The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark

map, and filters candidates based on visual consistency

The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and

positional information to directly output the predicted pose.For language-based target localization, the model interprets natural language

instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node

association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.To empower Astra-Global with robust localization

abilities, the team employed a meticulous training methodology

Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO)

SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend

estimation

In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used

to train for visual-language localization

environments, surpassing SFT-only methods.Astra-Local: The Intelligent Assistant for Local PlanningAstra-Local acts as the intelligent

odometry from sensor data

Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.The 4D spatio-temporal

encoder replaces traditional mobile stack perception and prediction modules

It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to

convert 2D image features into 3D voxel features

This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering

The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future

voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.The

planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using

Transformer-based flow matching

To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field)

This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates

Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other

data)

It trains a Transformer model to fuse information from different sensors

Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddings, fed into a

Transformer encoder, and finally uses a CLS token to predict relative pose

accuracy and reducing overall trajectory error.Experimental ValidationExtensive experiments were conducted in diverse indoor environments

validated through various experiments, demonstrating superior performance in handling text and image localization queries

reliance on global features, Astra-Global precisely captures fine details like room numbers, preventing localization errors in similar

scenes.Viewpoint Robustness: Based on semantic landmarks, Astra-Global maintains stable localization even with large camera angle changes,

where VPR methods typically fail.Pose Accuracy: Astra-Global leverages landmark spatial relationships to select the best matching pose,

showing significantly higher pose accuracy (within 1-meter distance error and 5-degree angular error) than traditional VPR, with over 30%

The planning head, using Transformer-based flow matching and masked ESDF loss, outperformed methods like ACT and diffusion policies in

collision rate, speed, and overall score on OOD datasets

multimodal datasets including synchronized image sequences, IMU, wheel data, and ground truth poses

estimation

Integrating IMU data dramatically improved rotational estimation accuracy, reducing overall trajectory error to approximately 2%

Further inclusion of wheel data enhanced scale stability and estimation accuracy, validating its superior multi-sensor data fusion

capabilities.Astra holds significant promise for future development and applications

Its deployment can be expanded to more complex indoor environments like large shopping malls, hospitals, and libraries, where it can assist

in tasks such as precise product location, efficient medical supply delivery, and book organization.However, areas for improvement exist

For Astra-Global, while current map representations balance information loss and token length, they may occasionally lack critical semantic

details

Future work will focus on alternative map compression methods to optimize efficiency while maximizing semantic information retention

Additionally, current single-frame localization can fail in feature-scarce or highly repetitive environments; future plans include active

exploration mechanisms and temporal reasoning for more robust localization.For Astra-Local, improving robustness to out-of-distribution

(OOD) scenarios is crucial, requiring enhanced model architectures and training methods

Redesigning the fallback system for tighter integration and seamless switching is also planned to improve system stability

Furthermore, integrating instruction-following capabilities will enable robots to understand and execute natural language commands,

expanding their usability in dynamic, human-centric environments and fostering more natural human-robot interaction.Like this:LikeLoading...

ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous Robot Navigation