When the City Teaches the Car:
Label-Free 3D Perception from Infrastructure

Can city infrastructure act as a distributed teacher for autonomous vehicles?

Zhen Xu^*1, Jinsu Yoo^*1, Cristian Bautista^*1, Zanming Huang¹, Tai-Yu Pan²,
Zhenzhen Liu³, Katie Z Luo⁴, Mark Campbell³, Bharath Hariharan³, Wei-Lun Chao^1,5

¹The Ohio State University ²Google ³Cornell University ⁴Stanford University ⁵Boston University
^*Equal contribution

Paper Code (soon) CIVET Dataset (soon)

Motivation

Building robust 3D perception for self-driving relies on repeated collect–label–retrain cycles that become impractical as deployment expands to new regions. Meanwhile, cities are increasingly equipped with roadside units (RSUs) — static sensors at intersections that continuously observe traffic.

Can city infrastructure act as a distributed, unsupervised teacher for autonomous vehicles — with zero human labels?

We conduct a concept-and-feasibility study to systematically investigate this question. Using a controlled multi-agent simulation environment with 12 RSUs in each geo-fenced town, we show that stationary RSUs can learn reliable local detectors purely from temporal observations, and that the resulting pseudo-labels are sufficient to train competitive ego-side detectors — demonstrating infrastructure-taught learning as a promising, orthogonal path to reducing annotation cost.

Method Overview

R&B-POP three-stage pipeline — Fig. 2 — Three-stage pipeline: RSUs train local detectors using temporal consistency (Stage 1, offline), broadcast predictions to passing ego vehicles (Stage 2, online), and the ego trains a standalone detector using aggregated pseudo-labels (Stage 3, offline).

The key insight is stationarity: because RSUs observe the same scene repeatedly, background points are stable while dynamic objects appear and disappear — a direct supervisory signal requiring no annotation.

Stage 1 — RSU Training

Each RSU computes persistence point (PP) scores to identify dynamic objects. DBSCAN clusters transient points into proposals, refined with tracking. Each RSU trains its own local detector. 82.8% avg AP across RSUs.

Stage 2 — Label Transfer

As the ego drives, RSUs within 160 m broadcast predictions in ego coordinates. Overlapping predictions are fused with distance-weighted NMS. 89.5% pseudo-label recall for cars.

Stage 3 — Ego Training

The ego trains a standalone 3D detector on the collected pseudo-labeled dataset. No RSU communication required at test time.

Result Overview

Urban town from CIVET dataset (our collected data), CenterPoint, BEV AP @ IoU 0.5 (car) / 0.3 (ped, cyc). No human labels used at any stage.

Annotation Source	Car	Ped.	Cyc.	Avg.
Unsup. RSUs	82.3	79.3	68.5	76.7
Supervised RSUs	93.9	84.4	93.4	90.6
Ego ground-truth ⋆	94.4	91.8	96.6	94.3

Blue row: our label-free result. ⋆ Supervised upper bound.

Key Findings

RSU detectors are location-specific. Each RSU excels within its own field of view but degrades elsewhere — motivating distributed supervision from many RSUs.
More RSUs help. Performance improves consistently from 4 → 8 → 12 RSUs as diversity of supervision increases.
Placement matters. With 4 RSUs: intersection > corner > linear layout, due to richer traffic coverage at intersections.
Domain adaptation. Infrastructure pseudo-labels from a new town can adapt a detector trained elsewhere (rural → urban: 59.2 → 81.9 AP).
Communication noise. Positional noise hurts small objects most; heuristic box refinement consistently recovers AP.
Complementarity with ego-centric methods. Infrastructure pseudo-labels and ego-centric signals (e.g., Oyster) provide complementary supervision — combining both sources yields further AP gains over either alone.

BibTeX

@article{xu2026civet,
  title={When the City Teaches the Car: Label-Free 3D Perception from Infrastructure},
  author={Xu, Zhen and Yoo, Jinsu and Bautista, Cristian and Huang, Zanming and Pan, Tai-Yu and Liu, Zhenzhen and Luo, Katie Z and Campbell, Mark and Hariharan, Bharath and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2603.16742},
  year={2026}
}