带导航栏的页面

Generating Multimodal Driving Scenes via Next-Scene Prediction

CVPR 2025

1Xi'an Jiaotong University   2Horizon Robotics   3EPFL   4University of Chinese Academy of Sciences  
UMGen generates multimodal driving scenes, each scene integrating
1) ego-vehicle actions, 2) raster maps, 3) traffic agents, and 4) images.
All visualized elements are generated by UMGen.

1. Overview

MY ALT TEXT

Overview of UMGen. (a) Starting from a random initialization, UMGen generates ego-centric, multimodal scenes frame-by-frame. Each scene encompasses four modalities: ego-vehicle action, map, traffic agent, and image. (b) UMGen offers multiple functions. It can not only imagine multimodal scene sequences autonomously but can also predict the other multimodalities based on input ego-vehicle actions. Furthermore, UMGen can incorporate user-specified agent actions to create customized scene sequences.

2. Abstract

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

3. Method

MY ALT TEXT

Pipeline of our UMGen. Given T past frames of multimodal driving scenes, including ego-action, maps, traffic agents, and images at each scene, each modality is tokenized into discrete tokens. The token embeddings are then processed through the Ego-action Prediction module, which forecasts the ego-action for T+1 time step. Using this predicted ego-action, the AMA module adjusts the map features. Next, the TAR module aggregates temporal information across sequences, while the OAR module ensures sequential prediction within each frame by autoregressively generating each token conditioned on the aggregated history information. Finally, the predicted tokens are fed to the decoder to obtain the next scene.

4. Visualization


A. Demonstration of Autoregressive Scene Generation
UMGen autoregressively generates four key modalities,
including ego-action, map, agent, and image, within each frame.
B. Driving Scene Sequence Generation
Long-duration multi-modal driving scene generation. All modalities in the visualization are generated by UMGen.
Diverse multi-modal driving scene generation.

C. Interactive Ego-vehicle Control
The ego-vehicle is controlled to either drive straight or make a right turn at the intersection.
The ego-vehicle is controlled to either wait behind the agent or execute a lane change to overtake.

D. User-Specified Scene Generation
The agent is controlled to simulate a cut-in maneuver. The ego-vehicle is controlled to brake or to execute a lane change to avoid a collision.

E. Video Enhancement via a Diffusion Model
We train a transformer-based diffusion model to further improve the quality of generated videos, leading to higher resolution, better visual clarity, and enhanced overall realism.
Left: Original UMGen-generated video, Right: Diffusion-refined UMGen-generated video.