QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

TLDR: QueryOcc

Query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions.


Method

QueryOcc predicts whether any 3D point in the scene is occupied and its semantic class — directly from multi-view images, without ever building a fixed voxel grid. The model takes a 4D query and a set of calibrated images, and outputs an occupancy probability and a distribution over semantic classes for that exact point.

The pipeline has four stages: (1) a standard image encoder extracts per-view features, (2) a Lift–Contract–Splat module lifts them into a BEV representation, (3) ResBlocks and deformable attention process the BEV features, and (4) a lightweight decoder answers arbitrary 4D queries from the resulting feature map.


Self-Supervised Training

QueryOcc is trained without any manual 3D labels. Supervision comes entirely from observations at adjacent timesteps: the model must correctly predict occupancy and semantics at 4D queries derived from those frames.

Point clouds are obtained from either images or lidar. In the camera-only setup (QueryOcc), metric depth from an off-the-shelf depth model lifts pixels into pseudo point clouds, paired with semantic pseudo-labels or dense vision foundation model (VFM) features. If lidar is available, observed point clouds can replace or augment the pseudo points — this extended setup is QueryOcc+.

From any point cloud, supervision is generated as a set of 4D queries sampled along sensor rays. Negative (unoccupied) queries are placed between the sensor origin and each surface point; positive (occupied) queries are placed just behind it. The model is trained with binary cross-entropy for occupancy, cross-entropy for semantics, and L1 for optional VFM feature distillation — all without ever discretizing the scene into a fixed voxel grid.


Results

QueryOcc sets a new state of the art among self-supervised camera-based methods on Occ3D-nuScenes, surpassing all prior work across both semantic and geometric metrics — while running at 11.6 FPS.

Method RayIoU ↑ IoU ↑
Sem.Dyn.Occ. Sem.Dyn.Occ.
SelfOcc 10.97.229.2 10.53.745.0
OccNeRF 10.83.722.8
DistillNeRF 10.15.229.1
LangOcc 11.69.038.7 13.37.751.8
GaussianOcc 11.9 11.37.0
MinkOcc w/lidar 12.5 13.23.4
GaussTR FeatUp 13.814.534.2 13.39.045.2
GaussTR T2D 14.217.733.8 13.913.444.5
GaussianFlowOcc 18.7 17.110.146.9
GaussianFlowOcc* 18.217.236.0 16.19.940.2
QueryOcc 23.621.745.2 21.313.255.0
QueryOcc+ 25.823.847.4 23.515.756.9

† RayIoU reproduced  ·  * Both metrics reproduced  ·  1st 2nd 3rd among comparable methods.

Model Mean Barrier Bicycle Bus Car Cons. veh. Drive. surf. Manmade Motorcycle Pedestrian Sidewalk Terrain Traffic cone Trailer Truck Vegetation
SelfOcc10.5 0.20.75.512.50.055.514.20.82.126.326.50.00.08.35.6
OccNeRF10.8 0.80.85.112.53.552.618.50.23.120.824.81.80.53.913.2
DistillNeRF10.1 1.42.110.210.12.643.014.12.05.516.915.04.61.47.915.1
LangOcc13.3 3.19.06.314.20.443.719.610.86.29.526.49.03.810.726.4
GaussianOcc11.3 1.85.814.613.61.344.68.62.88.020.117.69.80.69.610.3
GaussTR FeatUp13.3 2.15.214.120.45.739.421.27.15.115.722.93.90.913.421.9
GaussTR T2D13.9 6.58.521.824.36.337.021.215.57.917.27.21.96.117.210.0
GaussianFlowOcc17.1 7.29.317.617.94.563.914.69.38.531.135.110.72.011.812.6
QueryOcc21.3 7.36.826.520.94.869.225.210.915.034.538.413.23.717.325.7
QueryOcc+23.5 9.010.030.425.54.669.628.016.517.037.242.411.83.418.528.8

Per-class IoU ↑ on Occ3D-nuScenes. 1st 2nd 3rd among comparable methods.


Qualitative Examples

QueryOcc produces sharp geometry, maintains fine-grained detail, and infers plausible structures behind occlusions.

Scene A — Sharp vehicle geometry and fine structures like road signs recovered.
Scene B — Motorcyclist detected; thin poles missed. Background regions predicted well.
Scene C — Pedestrians reconstructed including partially occluded instances. Plausible surface inferred behind occlusions.

Long-range predictions — Extending visualization from ±40 m to ±60 m shows that the contracted BEV preserves useful geometric signal well beyond the evaluation boundary.

Scene D — Road layout and free-space remain consistent far outside the high-resolution region.
Scene E — Bending road curvature recovered from contracted features at 60 m range.

BibTeX

@article{lilja2026queryocc,
  title        = {QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy},
  author       = {Adam Lilja and Ji Lan and Junsheng Fu and Lars Hammarstrand},
  journal      = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year         = {2026}
}