Fig. 1 — As roadside infrastructure scales across a city, RSUs can serve as distributed, self-learning teachers for ego vehicles — without any human labels.
Building robust 3D perception for self-driving relies on repeated collect–label–retrain cycles that become impractical as deployment expands to new regions. Meanwhile, cities are increasingly equipped with roadside units (RSUs) — static sensors at intersections that continuously observe traffic.
Can city infrastructure act as a distributed, unsupervised teacher for autonomous vehicles — with zero human labels?
We conduct a concept-and-feasibility study to systematically investigate this question. Using a controlled multi-agent simulation environment with 12 RSUs in each geo-fenced town, we show that stationary RSUs can learn reliable local detectors purely from temporal observations, and that the resulting pseudo-labels are sufficient to train competitive ego-side detectors — demonstrating infrastructure-taught learning as a promising, orthogonal path to reducing annotation cost.
Method Overview
Fig. 2 — Three-stage pipeline: RSUs train local detectors using temporal consistency (Stage 1, offline), broadcast predictions to passing ego vehicles (Stage 2, online), and the ego trains a standalone detector using aggregated pseudo-labels (Stage 3, offline).
The key insight is stationarity: because RSUs observe the same scene repeatedly, background points are stable while dynamic objects appear and disappear — a direct supervisory signal requiring no annotation.
Stage 1 — RSU Training
Each RSU computes persistence point (PP) scores to identify dynamic objects. DBSCAN clusters transient points into proposals, refined with tracking. Each RSU trains its own local detector. 82.8% avg AP across RSUs.
Stage 2 — Label Transfer
As the ego drives, RSUs within 160 m broadcast predictions in ego coordinates. Overlapping predictions are fused with distance-weighted NMS. 89.5% pseudo-label recall for cars.
Stage 3 — Ego Training
The ego trains a standalone 3D detector on the collected pseudo-labeled dataset. No RSU communication required at test time.
Result Overview
Urban town from CIVET dataset (our collected data), CenterPoint, BEV AP @ IoU 0.5 (car) / 0.3 (ped, cyc). No human labels used at any stage.
Annotation Source
Car
Ped.
Cyc.
Avg.
Unsup. RSUs
82.3
79.3
68.5
76.7
Supervised RSUs
93.9
84.4
93.4
90.6
Ego ground-truth ⋆
94.4
91.8
96.6
94.3
Blue row: our label-free result. ⋆ Supervised upper bound.
Key Findings
RSU detectors are location-specific. Each RSU excels within its own field of view but degrades elsewhere — motivating distributed supervision from many RSUs.
More RSUs help. Performance improves consistently from 4 → 8 → 12 RSUs as diversity of supervision increases.
Placement matters. With 4 RSUs: intersection > corner > linear layout, due to richer traffic coverage at intersections.
Domain adaptation. Infrastructure pseudo-labels from a new town can adapt a detector trained elsewhere (rural → urban: 59.2 → 81.9 AP).
Communication noise. Positional noise hurts small objects most; heuristic box refinement consistently recovers AP.
Complementarity with ego-centric methods. Infrastructure pseudo-labels and ego-centric signals (e.g., Oyster) provide complementary supervision — combining both sources yields further AP gains over either alone.
BibTeX
@article{xu2026civet,
title={When the City Teaches the Car: Label-Free 3D Perception from Infrastructure},
author={Xu, Zhen and Yoo, Jinsu and Bautista, Cristian and Huang, Zanming and Pan, Tai-Yu and Liu, Zhenzhen and Luo, Katie Z and Campbell, Mark and Hariharan, Bharath and Chao, Wei-Lun},
journal={arXiv preprint arXiv:2603.16742},
year={2026}
}