Generating Multimodal Driving Scenes via Next-Scene Prediction

CVPR 2025

¹Xi'an Jiaotong University ²Horizon Robotics ³EPFL ⁴University of Chinese Academy of Sciences

1. Overview

Overview of UMGen. (a) Starting from a random initialization, UMGen generates ego-centric, multimodal scenes frame-by-frame. Each scene encompasses four modalities: ego-vehicle action, map, traffic agent, and image. (b) UMGen offers multiple functions. It can not only imagine multimodal scene sequences autonomously but can also predict the other multimodalities based on input ego-vehicle actions. Furthermore, UMGen can incorporate user-specified agent actions to create customized scene sequences.

2. Abstract

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

3. Method

Pipeline of our UMGen. Given T past frames of multimodal driving scenes, including ego-action, maps, traffic agents, and images at each scene, each modality is tokenized into discrete tokens. The token embeddings are then processed through the Ego-action Prediction module, which forecasts the ego-action for T+1 time step. Using this predicted ego-action, the AMA module adjusts the map features. Next, the TAR module aggregates temporal information across sequences, while the OAR module ensures sequential prediction within each frame by autoregressively generating each token conditioned on the aggregated history information. Finally, the predicted tokens are fed to the decoder to obtain the next scene.

Generating Multimodal Driving Scenes via Next-Scene Prediction

CVPR 2025

1. Overview

2. Abstract

3. Method

4. Visualization