I-Perceive: A Foundation Model for Vision-Language Active Perception

Anonymous Authors

πŸ”’ This page is hosted anonymously while the paper is under review. Authors, affiliations, code, and the camera-ready URL will be revealed after the review process.

Online demo

Upload up to a handful of photos of a scene, describe a task, and the backend will predict the next-best viewpoint and visualise it as a 3D point cloud. Inference runs on a single GPU on our private compute node.

or

Choose up to 6 photos.

No images yet.

Click a thumbnail to set it as the start frame (β˜…).

  • The demo runs on a single private GPU. Requests are processed one at a time; expect a short wait under load.
  • Please do not upload sensitive content.

Abstract

I-Perceive overview

Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception.We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.