GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

TLDR

We learn a unified representation by predicting general occupancy, ego occupancy, and distilled high-level features from a vision foundation model in a continuous 4D field a self-supervised manner. In doing so, we learn representation better aligned for multiple downstream tasks in autonomous driving.


Abstract

Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving.

Method

We build upon UnO and learn to predict 4D occupancy fields. While we can learn a lot about geometry and dynamics from solely lidar supervision, we also want to learn high-level semantic features as they are crucial for downstream tasks. We therefore distill high-level features from a vision foundation model and predict them in the same way as the occupancy fields. This way, we learn a unified representation that captures both the geometric and semantic structure of the environment.

Results

Qualitative Results

Fun things first. Lets look at some visualizations of the learned representation. In all of these visualizations we’ve reduced the high-dimensional semantic features to RGB using PCA and then projected them to the camera view and holistic view.

Fig 1.: PCA reduced 4D semantic features projected to camera view and holistic view. Images are shown for reference.
Fig 2.: PCA reduced 4D semantic features projected to camera view and holistic view. Images are shown for reference.
Fig 3.: Semantic features probed around the ego-vehicle, shown from BEV. Images and lidar input are shown for reference.

We can also show the ego path probability in a three-way intersection together with the PCA reduced semantic features. This visualization shows how the model has learned to predict a multimodal ego path probability.

Fig 4.: Ego path probability in a three-way intersection together with PCA reduced semantic features.

Lastly, lets look when GASP predicts the evolution of the environment into the future. Here we show occupancy prediction and how it evolves 3 seconds into the future.

Fig 5.: Occupancy prediction overlayed with lidar input. Shows predictions into future time (3 seconds). Images and lidar input are shown for reference.

Quantitative Results

To show that semantic information is indeed crucial for downstream tasks, we evaluate GASP on several downstream autonomous driving tasks, such as semantic occupancy forecasting, online mapping, and ego trajectory prediction. We compare GASP to UnO and to training from scratch. We show consistent improvements across all tasks, demonstrating the effectiveness of GASP.

  • BEV Semantic Forecasting: Predicting occupancy and class over time (2D+time).
  • 4D Semantic Occupancy: BEV Semantic Forecasting generalized to 3D (3D+time).
  • 4D Occupancy: Class agnostic occupancy prediction (3D+time)
  • Map segmentation: BEV (2D) segmentation of common mapping classes (e.g., lanes and crosswalks).
  • Ego Trajectory Prediction: Predicting the ego-vehicles future trajectory (3D+time).
Fig 6.: Green is GASP, blue is UnO, and yellow is from scratch.

BibTeX

@article{ljungbergh2025gasp,
  title        = {GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving},
  author       = {Ljungbergh, William and Lilja, Adam and Tonderski, Adam and Laveno Ling, Arvid and Lindstr{\"o}m, Carl and Verbeke, Willem and Fu, Junsheng and Petersson, Christoffer and Hammarstrand, Lars and Felsberg, Michael},
  journal      = {arXiv preprint arXiv:2503.15672},
  year         = {2025}
}