Learning 3D Perception from Others' Predictions

A label-efficient framework for 3D detection using expert predictions

Jinsu Yoo¹, Zhenyang Feng¹, Tai-Yu Pan¹, Yihong Sun², Cheng Perng Phoo², Xiangyu Chen²,
Mark Campbell², Kilian Q Weinberger², Bharath Hariharan², Wei-Lun Chao¹

¹The Ohio State University ²Cornell University

ICLR 2025

Paper Code

Motivation

The expert agent shares predicted bounding boxes — no raw data or model weights required.

Can we reuse existing expert perception sources — like robotaxis — to train ego vehicles for label-efficient learning?

We study a new scenario where an ego vehicle learns from the 3D bounding box predictions of a nearby expert agent — without accessing raw sensor data or model weights. This is label-efficient, sensor-agnostic, and communication-light. But naively using received predictions as ground-truth labels yields poor performance (22.0 AP), revealing two fundamental challenges.

Key Challenges

Mislocalization: GPS inaccuracies or synchronization delays (e.g., 0.1 s @ 60 mph → 2.7 m error).
Viewpoint mismatch: Objects visible to one agent may be occluded or outside the other's FoV, causing false positives and negatives that worsen with inter-agent distance.

Method Overview

Fig. 3 — R&B-POP pipeline: collect predictions → refine with box ranker → train detector with distance-based curriculum.

Refining & Discovering Boxes for 3D Perception from Others' Predictions. Two complementary components address the two challenges:

Box Ranker

Trained with as few as 40 annotated frames (<1% of data), the ranker estimates localization quality and selects the best candidate via coarse-to-fine sampling — correcting GPS/sync errors without re-annotating the full dataset. Can also be trained on simulation data with zero real labels.

Distance-Based Curriculum

Viewpoint mismatch worsens with inter-agent distance. We exploit this by first self-training on high-quality close-range frames, then propagating labels to harder long-range frames in a second round — discovering objects not visible to the reference agent.

The pipeline is sensor- and detector-agnostic: works with 8/16/32-beam LiDARs, PointPillars or SECOND, and generalizes to sim-to-real domain adaptation.

Result Overview

R&B-POP closes nearly all of the gap to the supervised upper bound (56.5 vs. 58.4 AP@IoU 0.5) while using no human labels for detector training — only 40 frames for the ranker.

Label source	Refinement	Self-training	0–30m	30–50m	50–80m	0–80m
R's pred	—	—	34.7	13.5	8.6	22.0
R's GT	—	—	29.7	14.1	7.3	19.6
R's pred	heuristic	—	53.2	22.0	16.9	37.8
R's pred	ranker	—	50.3	24.7	18.2	38.0
R's pred	—	naive ST	45.9	18.7	16.5	32.4
R's pred	heuristic	naive ST	50.4	19.6	15.4	35.4
R's pred	ranker	naive ST	60.6	29.7	19.2	45.0
R's pred	—	dist. curriculum	57.3	29.6	21.0	42.5
R's pred	heuristic	dist. curriculum	60.5	25.5	17.0	43.2
R's pred	ranker	dist. curriculum	73.3	43.3	23.3	56.5
E's GT ⋆	—	—	75.2	45.9	28.8	58.4

V2V4Real, PointPillars, 32-beam LiDAR. AP@IoU 0.5. ⋆ Supervised upper bound.

Qualitative results — pseudo-label quality improves progressively: R's noisy predictions → after box refinement → after distance-based curriculum self-training → E's ground truth.

BibTeX

@inproceedings{yoo2025rnbpop,
  title={Learning 3D Perception from Others' Predictions},
  author={Yoo, Jinsu and Feng, Zhenyang and Pan, Tai-Yu and Sun, Yihong and
          Phoo, Cheng Perng and Chen, Xiangyu and Campbell, Mark and
          Weinberger, Kilian Q. and Hariharan, Bharath and Chao, Wei-Lun},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}