QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Published in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

Links

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

TLDR: QueryOcc

Query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions.

▶

Method

QueryOcc predicts whether any 3D point in the scene is occupied and its semantic class — directly from multi-view images, without ever building a fixed voxel grid. The model takes a 4D query and a set of calibrated images, and outputs an occupancy probability and a distribution over semantic classes for that exact point.

The pipeline has four stages: (1) a standard image encoder extracts per-view features, (2) a Lift–Contract–Splat module lifts them into a BEV representation, (3) ResBlocks and deformable attention process the BEV features, and (4) a lightweight decoder answers arbitrary 4D queries from the resulting feature map.

Self-Supervised Training

QueryOcc is trained without any manual 3D labels. Supervision comes entirely from observations at adjacent timesteps: the model must correctly predict occupancy and semantics at 4D queries derived from those frames.

Point clouds are obtained from either images or lidar. In the camera-only setup (QueryOcc), metric depth from an off-the-shelf depth model lifts pixels into pseudo point clouds, paired with semantic pseudo-labels or dense vision foundation model (VFM) features. If lidar is available, observed point clouds can replace or augment the pseudo points — this extended setup is QueryOcc+.

From any point cloud, supervision is generated as a set of 4D queries sampled along sensor rays. Negative (unoccupied) queries are placed between the sensor origin and each surface point; positive (occupied) queries are placed just behind it. The model is trained with binary cross-entropy for occupancy, cross-entropy for semantics, and L1 for optional VFM feature distillation — all without ever discretizing the scene into a fixed voxel grid.

Results

QueryOcc sets a new state of the art among self-supervised camera-based methods on Occ3D-nuScenes, surpassing all prior work across both semantic and geometric metrics — while running at 11.6 FPS.

Method	RayIoU ↑			IoU ↑
Method	Sem.	Dyn.	Occ.	Sem.	Dyn.	Occ.
SelfOcc^†	10.9	7.2	29.2	10.5	3.7	45.0
OccNeRF	—	—	—	10.8	3.7	22.8
DistillNeRF	—	—	—	10.1	5.2	29.1
LangOcc^†	11.6	9.0	38.7	13.3	7.7	51.8
GaussianOcc	11.9	—	—	11.3	7.0	—
MinkOcc w/lidar	12.5	—	—	13.2	3.4	—
GaussTR^† FeatUp	13.8	14.5	34.2	13.3	9.0	45.2
GaussTR^† T2D	14.2	17.7	33.8	13.9	13.4	44.5
GaussianFlowOcc	18.7	—	—	17.1	10.1	46.9
GaussianFlowOcc^*	18.2	17.2	36.0	16.1	9.9	40.2
QueryOcc	23.6	21.7	45.2	21.3	13.2	55.0
QueryOcc+	25.8	23.8	47.4	23.5	15.7	56.9

† RayIoU reproduced · * Both metrics reproduced · 1st 2nd 3rd among comparable methods.

Model	Mean	Barrier	Bicycle	Bus	Car	Cons. veh.	Drive. surf.	Manmade	Motorcycle	Pedestrian	Sidewalk	Terrain	Traffic cone	Trailer	Truck	Vegetation
SelfOcc	10.5	0.2	0.7	5.5	12.5	0.0	55.5	14.2	0.8	2.1	26.3	26.5	0.0	0.0	8.3	5.6
OccNeRF	10.8	0.8	0.8	5.1	12.5	3.5	52.6	18.5	0.2	3.1	20.8	24.8	1.8	0.5	3.9	13.2
DistillNeRF	10.1	1.4	2.1	10.2	10.1	2.6	43.0	14.1	2.0	5.5	16.9	15.0	4.6	1.4	7.9	15.1
LangOcc	13.3	3.1	9.0	6.3	14.2	0.4	43.7	19.6	10.8	6.2	9.5	26.4	9.0	3.8	10.7	26.4
GaussianOcc	11.3	1.8	5.8	14.6	13.6	1.3	44.6	8.6	2.8	8.0	20.1	17.6	9.8	0.6	9.6	10.3
GaussTR FeatUp	13.3	2.1	5.2	14.1	20.4	5.7	39.4	21.2	7.1	5.1	15.7	22.9	3.9	0.9	13.4	21.9
GaussTR T2D	13.9	6.5	8.5	21.8	24.3	6.3	37.0	21.2	15.5	7.9	17.2	7.2	1.9	6.1	17.2	10.0
GaussianFlowOcc	17.1	7.2	9.3	17.6	17.9	4.5	63.9	14.6	9.3	8.5	31.1	35.1	10.7	2.0	11.8	12.6
QueryOcc	21.3	7.3	6.8	26.5	20.9	4.8	69.2	25.2	10.9	15.0	34.5	38.4	13.2	3.7	17.3	25.7
QueryOcc+	23.5	9.0	10.0	30.4	25.5	4.6	69.6	28.0	16.5	17.0	37.2	42.4	11.8	3.4	18.5	28.8

Per-class IoU ↑ on Occ3D-nuScenes. 1st 2nd 3rd among comparable methods.

Qualitative Examples

QueryOcc produces sharp geometry, maintains fine-grained detail, and infers plausible structures behind occlusions.

**Scene A** — Sharp vehicle geometry and fine structures like road signs recovered.

**Scene B** — Motorcyclist detected; thin poles missed. Background regions predicted well.

**Scene C** — Pedestrians reconstructed including partially occluded instances. Plausible surface inferred behind occlusions.

Long-range predictions — Extending visualization from ±40 m to ±60 m shows that the contracted BEV preserves useful geometric signal well beyond the evaluation boundary.

**Scene D** — Road layout and free-space remain consistent far outside the high-resolution region.

**Scene E** — Bending road curvature recovered from contracted features at 60 m range.

BibTeX

@article{lilja2026queryocc,
  title        = {QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy},
  author       = {Adam Lilja and Ji Lan and Junsheng Fu and Lars Hammarstrand},
  journal      = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year         = {2026}
}