Leveraging Sparse LiDAR for RAFT-Stereo:
A Depth Pre-Fill Perspective

Sparse LiDAR-guided depth estimation via depth pre-filling

Jinsu Yoo¹, Sooyoung Jeon¹, Zanming Huang¹, Tai-Yu Pan², Wei-Lun Chao¹

¹The Ohio State University ²Google

arXiv 2025

Motivation

Stereo cameras are cheap and ubiquitous, but lose accuracy in textureless regions and at range. LiDAR provides precise, absolute depth measurements but is expensive; in practice it is often operated at very low density — as few as 300 projected points per image frame. Can we build a stereo estimator that genuinely benefits from such extremely sparse LiDAR guidance?

RAFT-Stereo is a widely used iterative stereo matcher. Due to its architectural simplicity, it is straightforward to inject LiDAR depth into the initial disparity map fed to the GRU—a form of late fusion that directly guides cost-volume retrieval. Under dense LiDAR this works well, but performance gains collapse at extreme sparsity.

The Problem

Naively injecting only 300 LiDAR points into RAFT-Stereo gives only marginal improvement over the unguided baseline. Why?

The culprit is cost-volume retrieval. RAFT-Stereo looks up features in the 4D cost volume using the current disparity estimate as a coordinate. When only a handful of sparse LiDAR points are present, the lookup pattern is spatially discontinuous: most pixels query at zero while a few scattered pixels query at accurate LiDAR-derived coordinates. This discontinuity creates sharp spatial gradients in the retrieved feature map, effectively attenuating the LiDAR signal rather than amplifying it.

Cost volume retrieval under different disparity maps

Figure 1. Cost-volume retrieval under four initial disparity maps. Sparse points (3rd column, ∇²w = 0.65) produce almost as discontinuous features as the zero-disparity baseline (2nd column, ∇²w = 0.32). Pre-filled points (4th column, ∇²w = 0.33) closely match ground truth (1st column, ∇²w = 0.28).

Our Solution: GRAFT-Stereo

The fix is conceptually simple: pre-fill the sparse LiDAR depth map before injecting it into RAFT-Stereo as the initial disparity map (late fusion). Rather than handing the GRU a handful of scattered valid pixels surrounded by zeros, we first densify the sparse LiDAR into a spatially continuous depth map using IP-Basic—a simple morphological image-processing algorithm that requires no neural network and no training. The resulting smooth map enables well-behaved cost-volume retrieval, and RAFT-Stereo's iterative refinement takes care of the rest.

Late fusion — coarse pre-fill (image processing). IP-Basic densifies sparse LiDAR into a smooth, coarse depth map. No training required. This is the core contribution: it directly addresses the retrieval discontinuity problem.

As an additional enhancement, we also apply a learned pre-fill to the early-fusion path:

Early fusion — fine pre-fill (neural network, optional). A lightweight depth-completion network produces an accurate dense depth map for the feature encoder, further improving cost-volume construction. This is a plug-in bonus on top of the late-fusion fix.

Figure 3. GRAFT-Stereo pipeline. Step 1 (left): a neural network produces a fine dense depth map (early fusion, red dashed) while image processing produces a coarse dense depth map (late fusion, blue dashed). Step 2 (right): RAFT-Stereo iterative prediction guided by both pre-filled maps.

The two strategies are complementary: the coarse IP-Basic map works well for late fusion, while the neural-network map is better suited for early fusion. The late-fusion fix alone (GRAFT-Stereo late) already delivers large gains over the naive baseline; adding early fusion (GRAFT-Stereo full) pushes performance further.

Result Overview

GRAFT-Stereo achieves state-of-the-art performance across the full sparsity range on KITTI Depth Completion.

Figure 4. Qualitative comparison on KITTI at 300 LiDAR points. GRAFT-Stereo (Ours) achieves avg. error 0.172 m / median 0.055 m, versus RAFT-Stereo+gd (0.355 m / 0.139 m) and SDG-Depth (0.239 m / 0.074 m).

KITTI Depth Completion — Depth Metrics (RMSE / MAE in mm)

Method	300 pts†	1000 pts†	3000 pts†	4-beam§	8-beam§	16-beam§
EG-Depth	829.7 / 283.6	822.6 / 263.0	764.0 / 242.5	900.0 / 307.3	851.2 / 277.8	858.9 / 276.7
SDG-Depth	883.9 / 307.2	816.7 / 290.7	763.8 / 259.2	876.8 / 326.9	789.2 / 279.5	792.6 / 279.7
RAFT-Stereo + naive gd	929.5 / 297.0	835.0 / 274.8	827.5 / 243.1	863.8 / 297.4	874.7 / 293.6	873.3 / 288.5
GRAFT-Stereo (late)	817.7 / 237.5	747.0 / 210.1	717.8 / 193.9	817.9 / 255.1	803.0 / 243.2	808.0 / 241.1
GRAFT-Stereo (full)	739.5 / 220.2	716.5 / 208.3	675.6 / 186.2	779.0 / 240.2	774.4 / 235.1	767.8 / 230.5

†: uniform sampling; §: beam sampling. GRAFT-Stereo consistently outperforms all baselines across sparsity conditions.

BibTeX

@article{yoo2025leveraging,
  title={Leveraging Sparse LiDAR for RAFT-Stereo: A Depth Pre-Fill Perspective},
  author={Yoo, Jinsu and Jeon, Sooyoung and Huang, Zanming and
          Pan, Tai-Yu and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2507.19738},
  year={2025}
}