OpenCity3D 🏙️

3D Urban Scene Understanding with Vision-Language Models

WACV 2025

1ETH ZĂĽrich 2University of Zurich 3Stanford University


OpenCity3D is a zero-shot approach for open-vocabulary 3D urban scene understanding.

Explanatory Video

Abstract

The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks. Existing works focus on indoor scenes, or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes. In this work, we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding, and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities. In particular, we address higher-level 3D scene understanding tasks such as population density, building age, property prices, crime rate, and noise pollution. Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.

Prompt-based interaction with urban scenes


Illustration of the OpenCity3D pipeline. We first render multi-perspective images from aereal 3D reconstructions, then compute pixel-wise hierarchical visual-language features. Finally, we project them back to the mesh to enable language queries on the scene.



Zero-shot inference of abstract properties

Using Opencity3D you can estimate abstract properties by comparing the point-wise similarity to positive and negative prompts. That is, we can e.g. compute a building age score from the ratio of the similarities to "modern building" and "old building".

Zero-shot predicted age (left) vs. ground truth construction years (right) in Rotterdam.


Zero-shot predicted (left) vs. ground truth (right) prices of sold homes in Detroit.

BibTeX


@InProceedings{opencity3d2025,
    author    = {Bieri, Valentin and Zamboni, Marco and Blumer, Nicolas Samuel and Chen, Qingxuan and Engelmann, Francis},
    title     = {OpenCity3D: What do Vision-Language Models Know About Urban Environments?},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025},
    pages     = {5147-5155}
}
  }