More research

When the City Teaches the Car:
Label-Free 3D Perception from Infrastructure

Can city infrastructure act as a distributed teacher for autonomous vehicles?

1The Ohio State University   2Google   3Cornell University   4Stanford University   5Boston University
*Equal contribution

Motivation

Motivation: Can a city teach vehicles to perceive?
Fig. 1 — As roadside infrastructure scales across a city, RSUs can serve as distributed, self-learning teachers for ego vehicles — without any human labels.

Building robust 3D perception for self-driving relies on repeated collect–label–retrain cycles that become impractical as deployment expands to new regions. Meanwhile, cities are increasingly equipped with roadside units (RSUs) — static sensors at intersections that continuously observe traffic.

Can city infrastructure act as a distributed, unsupervised teacher for autonomous vehicles — with zero human labels?

We conduct a concept-and-feasibility study to systematically investigate this question. Using a controlled multi-agent simulation environment with 12 RSUs in each geo-fenced town, we show that stationary RSUs can learn reliable local detectors purely from temporal observations, and that the resulting pseudo-labels are sufficient to train competitive ego-side detectors — demonstrating infrastructure-taught learning as a promising, orthogonal path to reducing annotation cost.

Method Overview

R&B-POP three-stage pipeline
Fig. 2 — Three-stage pipeline: RSUs train local detectors using temporal consistency (Stage 1, offline), broadcast predictions to passing ego vehicles (Stage 2, online), and the ego trains a standalone detector using aggregated pseudo-labels (Stage 3, offline).

The key insight is stationarity: because RSUs observe the same scene repeatedly, background points are stable while dynamic objects appear and disappear — a direct supervisory signal requiring no annotation.

Stage 1 — RSU Training
Each RSU computes persistence point (PP) scores to identify dynamic objects. DBSCAN clusters transient points into proposals, refined with tracking. Each RSU trains its own local detector. 82.8% avg AP across RSUs.
Stage 2 — Label Transfer
As the ego drives, RSUs within 160 m broadcast predictions in ego coordinates. Overlapping predictions are fused with distance-weighted NMS. 89.5% pseudo-label recall for cars.
Stage 3 — Ego Training
The ego trains a standalone 3D detector on the collected pseudo-labeled dataset. No RSU communication required at test time.

Result Overview

Urban town from CIVET dataset (our collected data), CenterPoint, BEV AP @ IoU 0.5 (car) / 0.3 (ped, cyc). No human labels used at any stage.

Annotation SourceCarPed.Cyc.Avg.
Unsup. RSUs 82.379.368.576.7
Supervised RSUs93.984.493.490.6
Ego ground-truth ⋆94.491.896.694.3

Blue row: our label-free result. ⋆ Supervised upper bound.

Key Findings

BibTeX

@article{xu2026civet,
  title={When the City Teaches the Car: Label-Free 3D Perception from Infrastructure},
  author={Xu, Zhen and Yoo, Jinsu and Bautista, Cristian and Huang, Zanming and Pan, Tai-Yu and Liu, Zhenzhen and Luo, Katie Z and Campbell, Mark and Hariharan, Bharath and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2603.16742},
  year={2026}
}