More research

Learning 3D Perception from Others' Predictions

A label-efficient framework for 3D detection using expert predictions

1The Ohio State University    2Cornell University
ICLR 2025

Motivation

robotaxi (expert agent) ? ego car (no perception) share predictions as pseudo-labels

The expert agent shares predicted bounding boxes — no raw data or model weights required.

Can we reuse existing expert perception sources — like robotaxis — to train ego vehicles for label-efficient learning?

We study a new scenario where an ego vehicle learns from the 3D bounding box predictions of a nearby expert agent — without accessing raw sensor data or model weights. This is label-efficient, sensor-agnostic, and communication-light. But naively using received predictions as ground-truth labels yields poor performance (22.0 AP), revealing two fundamental challenges.

Key Challenges

Key challenges figure
Fig. 2 — Two sources of label noise: mislocalization from GPS/sync errors and viewpoint mismatch from occlusion and FoV differences.

Method Overview

R&B-POP pipeline
Fig. 3 — R&B-POP pipeline: collect predictions → refine with box ranker → train detector with distance-based curriculum.

Refining & Discovering Boxes for 3D Perception from Others' Predictions. Two complementary components address the two challenges:

Box Ranker
Trained with as few as 40 annotated frames (<1% of data), the ranker estimates localization quality and selects the best candidate via coarse-to-fine sampling — correcting GPS/sync errors without re-annotating the full dataset. Can also be trained on simulation data with zero real labels.
Distance-Based Curriculum
Viewpoint mismatch worsens with inter-agent distance. We exploit this by first self-training on high-quality close-range frames, then propagating labels to harder long-range frames in a second round — discovering objects not visible to the reference agent.

The pipeline is sensor- and detector-agnostic: works with 8/16/32-beam LiDARs, PointPillars or SECOND, and generalizes to sim-to-real domain adaptation.

Result Overview

R&B-POP closes nearly all of the gap to the supervised upper bound (56.5 vs. 58.4 AP@IoU 0.5) while using no human labels for detector training — only 40 frames for the ranker.

Label sourceRefinementSelf-training 0–30m30–50m50–80m0–80m
R's pred34.713.58.622.0
R's GT29.714.17.319.6
R's predheuristic53.222.016.937.8
R's predranker50.324.718.238.0
R's prednaive ST45.918.716.532.4
R's predheuristicnaive ST50.419.615.435.4
R's predrankernaive ST60.629.719.245.0
R's preddist. curriculum57.329.621.042.5
R's predheuristicdist. curriculum60.525.517.043.2
R's predrankerdist. curriculum 73.343.323.356.5
E's GT ⋆75.245.928.858.4

V2V4Real, PointPillars, 32-beam LiDAR. AP@IoU 0.5. ⋆ Supervised upper bound.

Qualitative results
Qualitative results — pseudo-label quality improves progressively: R's noisy predictions → after box refinement → after distance-based curriculum self-training → E's ground truth.

BibTeX

@inproceedings{yoo2025rnbpop,
  title={Learning 3D Perception from Others' Predictions},
  author={Yoo, Jinsu and Feng, Zhenyang and Pan, Tai-Yu and Sun, Yihong and
          Phoo, Cheng Perng and Chen, Xiangyu and Campbell, Mark and
          Weinberger, Kilian Q. and Hariharan, Bharath and Chao, Wei-Lun},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}