INSUBCONTINENT EXCLUSIVE:
The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for
advanced navigation systems
However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the
limitations of traditional approaches
innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile
robots.Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of
target localization, self-localization, and path planning
Target localization involves understanding natural language or image cues to pinpoint a destination on a map
Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like
warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes)
Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and
reaching intermediate waypoints.While foundation models have shown promise in integrating smaller models to tackle broader tasks, the
optimal number of models and their effective integration for comprehensive navigation remained an open question
https://astra-mobility.github.io/), addresses these limitations
Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local
Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local
path planning and odometry estimation
This architecture promises to revolutionize how robots navigate complex indoor spaces.Astra-Global: The Intelligent Brain for Global
LocalizationAstra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks:
self-localization and target localization
It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global
Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions
based on query images or text prompts.The construction of this robust localization system begins with offline mapping
The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):V (Nodes): Keyframes, obtained by
temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and
landmark references.E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.L
These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.In practical localization,
The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark
map, and filters candidates based on visual consistency
The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and
positional information to directly output the predicted pose.For language-based target localization, the model interprets natural language
instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node
association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.To empower Astra-Global with robust localization
abilities, the team employed a meticulous training methodology
Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO)
SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend
In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used
to train for visual-language localization
environments, surpassing SFT-only methods.Astra-Local: The Intelligent Assistant for Local PlanningAstra-Local acts as the intelligent
odometry from sensor data
Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.The 4D spatio-temporal
encoder replaces traditional mobile stack perception and prediction modules
It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to
convert 2D image features into 3D voxel features
This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering
The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future
voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.The
planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using
Transformer-based flow matching
To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field)
This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates
Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other
It trains a Transformer model to fuse information from different sensors
Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddings, fed into a
Transformer encoder, and finally uses a CLS token to predict relative pose
accuracy and reducing overall trajectory error.Experimental ValidationExtensive experiments were conducted in diverse indoor environments
validated through various experiments, demonstrating superior performance in handling text and image localization queries
reliance on global features, Astra-Global precisely captures fine details like room numbers, preventing localization errors in similar
scenes.Viewpoint Robustness: Based on semantic landmarks, Astra-Global maintains stable localization even with large camera angle changes,
where VPR methods typically fail.Pose Accuracy: Astra-Global leverages landmark spatial relationships to select the best matching pose,
showing significantly higher pose accuracy (within 1-meter distance error and 5-degree angular error) than traditional VPR, with over 30%
The planning head, using Transformer-based flow matching and masked ESDF loss, outperformed methods like ACT and diffusion policies in
collision rate, speed, and overall score on OOD datasets
multimodal datasets including synchronized image sequences, IMU, wheel data, and ground truth poses
Integrating IMU data dramatically improved rotational estimation accuracy, reducing overall trajectory error to approximately 2%
Further inclusion of wheel data enhanced scale stability and estimation accuracy, validating its superior multi-sensor data fusion
capabilities.Astra holds significant promise for future development and applications
Its deployment can be expanded to more complex indoor environments like large shopping malls, hospitals, and libraries, where it can assist
in tasks such as precise product location, efficient medical supply delivery, and book organization.However, areas for improvement exist
For Astra-Global, while current map representations balance information loss and token length, they may occasionally lack critical semantic
Future work will focus on alternative map compression methods to optimize efficiency while maximizing semantic information retention
Additionally, current single-frame localization can fail in feature-scarce or highly repetitive environments; future plans include active
exploration mechanisms and temporal reasoning for more robust localization.For Astra-Local, improving robustness to out-of-distribution
(OOD) scenarios is crucial, requiring enhanced model architectures and training methods
Redesigning the fallback system for tighter integration and seamless switching is also planned to improve system stability
Furthermore, integrating instruction-following capabilities will enable robots to understand and execute natural language commands,
expanding their usability in dynamic, human-centric environments and fostering more natural human-robot interaction.Like this:LikeLoading...