The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks.
Existing works focus on indoor scenes, or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes.
In this work, we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding,
and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities.
In particular, we address higher-level 3D scene understanding tasks such as population density, building age, property prices, crime rate, and noise pollution.
Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.