NaVILA: Legged Robot Vision-Language-Action Model for Navigation An-Chieh Cheng*, Yandong Ji*, Zhaojing Yang*, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem
Bıyık,
Hongxu
Yin✝︎,
Sifei Liu✝︎, Xiaolong Wang✝︎ preprint, 2024 A two-level framework that combines VLAs with locomotion skills for navigation. The
VLA is adapted
from a VLM and learns from human touring videos.
This paper proposes to solve the problem
of
Vision-and-Language Navigation with legged robots, which not only provides a
flexible way for humans to command but also allows the robot to navigate through more
challenging and
cluttered scenes.
However, it is non-trivial to translate human language instructions all the way to low-level leg
joint
actions. We
propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with
locomotion skills.
Instead of
directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with
spatial
information in the
form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual
locomotion RL policy
for
execution. NaVILA substantially improves previous approaches on existing benchmarks. The same
advantages are
demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes,
low-level
controls, and
real-world robot experiments.
NVILA: Efficient Frontier Visual Language Models NVILA Team
Zhijian Liu*, Ligeng Zhu*, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi
Cao, Yuxian
Gu,
Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng
,
Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov,
Jan Kautz,
Hongxu Yin✝︎, Song Han✝︎, Yao
Lu✝︎ CVPR, 2025 Efficient frontier VLM models with efficient training and inference.
Visual language models (VLMs) have made significant
advances in accuracy in recent years. However, their efficiency has
received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both
efficiency
and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and
temporal
resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to
efficiently
process high-resolution images and long videos. We also conduct a systematic investigation to enhance the
efficiency of
NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses
the
accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the
same time,
it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and
decoding
latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.
Vision Language Models (VLMs) have
demonstrated remarkable performance in 2D vision and language tasks. However, their ability to
reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT
(SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT
advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline
that enables effective learning of regional representation from 3D scene graphs, and (2) a
flexible plugin module for integrating depth information into the visual encoder of existing
VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can
accurately perceive their relative directions and distances. Additionally, we propose
SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor,
and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate
that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and
without local region prompts. The model also exhibits strong generalization capabilities,
effectively reasoning about complex spatial relations and functioning as a region-aware dense
reward annotator for robotic tasks.
Textures are a vital aspect of creating
visually
appealing and realistic 3D models. In this paper, we study the problem of generating
high-fidelity
texture given shapes of 3D assets, which has been relatively less explored compared with generic
3D
shape modeling. Our goal is to facilitate a controllable texture generation process, such that
one
texture code can correspond to a particular appearance style independent of any input shapes
from a
category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable
UV sphere
space rather than directly on the 3D shape. This allows the texture to be disentangled from the
underlying shape and transferable to other shapes that share the same UV space, i.e., from the
same
category. We integrate the UV sphere space with the radiance field, which provides a more
efficient and
accurate representation of textures than traditional texture maps. We perform our experiments on
real-world object datasets where we achieve not only realistic synthesis, but also substantial
improvements over state-of-the-arts on texture controlling and editing.
Autoregressive 3D Shape Generation via Canonical Mapping An-Chieh Cheng*, Xueting Li*, Sifei Liu, Min Sun, Ming-Hsuan Yang
ECCV, 2022 We decompose the point cloud into
meaningful shape sequences, then we encode these sequences through a transformer for
generation.
With the capacity of modeling long-range
dependencies in sequential data, transformers have shown remarkable performances in a variety of
generative tasks such as image, audio, and text generation. Yet, taming them in generating less
structured and voluminous data formats such as high-resolution point clouds have seldom been
explored
due to ambiguous sequentialization processes and infeasible computation burden. In this paper,
we aim to
further exploit the power of transformers and employ them for the task of 3D point cloud
generation. The
key idea is to decompose point clouds of one category into semantically aligned sequences of
shape
compositions, via a learned canonical space. These shape compositions can then be quantized and
used to
learn a context-rich composition codebook for point cloud generation. Experimental results on
point
cloud reconstruction and unconditional generation show that our model performs favorably against
state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape
completion as an application for conditional shape generation.
We propose a canonical point
autoencoder
(CPAE) that predicts dense correspondences between 3D shapes of the same category. The
autoencoder
performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical
primitive,
e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As
being
placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds
on the
canonical surface and to be reconstructed in an ordered fashion. Once trained, points from
different
shape instances that are mapped to the same locations on the primitive surface are determined to
be a
pair of correspondence. Our method does not require any form of annotation or self-supervised
part
segmentation network and can handle unaligned input point clouds. Experimental results on 3D
semantic
keypoint transfer and part segmentation transfer show that our model performs favorably against
state-of-the-art correspondence learning methods.