I-Perceive — Anonymous project page

Online demo

Upload up to a handful of photos of a scene, describe a task, and the backend will predict the next-best viewpoint and visualise it as a 3D point cloud. Inference runs on a single GPU on our private compute node.

1. Choose images

📷 Upload images or

Choose up to 6 photos.

2. Pick the start frame

No images yet.

Click a thumbnail to set it as the start frame (★).

3. Instruction

4. (Optional) Anchor point

The demo runs on a single private GPU. Requests are processed one at a time; expect a short wait under load.
Please do not upload sensitive content.

Experiment Gallery

Interactive 3D visualization of predicted camera poses from four methods on indoor active perception benchmarks. Orbit to explore the scene.

Dataset Case / 0 Light bg Point size

Instruction:

Input Views

Predicted Views

Abstract

Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception.We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.

I-Perceive: A Foundation Model for Vision-Language Active Perception

Online demo

Experiment Gallery

Abstract