RangeViM - Kiet Huynh

Introduction

LiDAR semantic segmentation.

A LiDAR scan is basically a large set of 3D points. Each point has (x, y, z) coordinates and often an intensity value, but by itself it does not say what it is. A road point, a tree point, and a car point are just numbers until a model gives them meaning.

Semantic segmentation is the part that adds that meaning. The model predicts a class for every point, so the scene becomes useful for downstream systems: road, car, pedestrian, vegetation, building, and many other labels instead of one raw geometric cloud.

I liked this problem because it is not only about recognizing objects. It also asks the model to label background structure, free space, poles, terrain, and messy boundary regions. RangeViM is my attempt to make that dense understanding accurate enough while still keeping the pipeline practical.

01

Acquisition

The sensor fires laser pulses and measures return time, turning the world into a dense 3D point cloud.

02

Preprocessing

The raw points are cleaned, aligned, and projected into a range image so the network can process them faster.

03

Neural Inference

The segmentation model predicts class scores in range-view space before sending labels back to points.

04

Downstream Use

The labeled scan can then support driving stacks, mapping, robotics, or other 3D scene understanding tasks.

Architecture

The pipeline I ended up with.

01

Point cloud

raw x, y, z and intensity

02

Range projection

turn the scan into a 64 x 2048 image

03

TinyViM backbone

mix local conv features with long-range SSM context

04

FPN decoder

recover dense detail across scales

05

KNN reprojection

send the labels back to points

Problem

The awkward tradeoff: speed versus context.

Range-view methods are attractive because they turn a 3D scan into something closer to an image, so inference can be much faster than fully 3D processing. The problem is that strong context modeling, especially with Transformer attention, gets expensive at full resolution. RangeViM keeps the full horizontal field of view instead of relying on cropped inference.

Backbone Choice

How I moved away from the Transformer baseline.

Transformer range-view baseline

RangeViT was a good starting point, but not a cheap one.

My first serious baseline was RangeViT, a pioneering Transformer-based method for range-view LiDAR images. Earlier projection-based models were mostly CNN-oriented, while RangeViT showed that attention could work very well in this setting and reach near-SOTA accuracy among range-projection methods.

A Vision Transformer is good at modeling broad dependencies across the range image, but self-attention grows quickly with token count. At 64 x 2048 resolution, the cost becomes hard to ignore and cropped inference starts to look tempting even though it weakens full-scene context.

Great at learning long-range relationships.
Easy to borrow ideas from image segmentation.
Gets expensive when the full range image is kept.
Crops can break the 360-degree context that LiDAR naturally has.

RangeViM TinyViM backbone

TinyViM gave me a cleaner compromise.

Before landing here, I tried a lot of ideas that did not survive experiments: Swin-style attention, MetaFormer/PVT-like variants, better projection recovery, and even hybrid range-view plus point-based directions inspired by works such as FRNet and HarDNet-style pipelines. Many versions became heavier without giving the performance I expected.

TinyViM gave me a way to keep local geometric detail with convolutional stages, while using state-space blocks for wider context at a lower cost than full attention. It was not magic, but it finally matched the shape of the problem better than my earlier attempts.

Keeps the projected scan in one forward pass.
Uses convolutions where local geometry matters.
Uses sequence modeling for wider horizontal context.
Still has the usual range-view weaknesses: projection collisions and reprojection sensitivity.

Technical Decisions

Small choices that mattered more than I expected.

01

Keep the full frame

I wanted the model to see the whole projected scan at once instead of guessing across crop boundaries.

02

Split local detail and long context

Convolutions handle sharp local geometry, while state-space blocks carry wider scene context without paying full attention cost.

03

Respect the shape of range images

The decoder uses wider horizontal kernels because LiDAR range images stretch scene context mostly along the azimuth direction.

Experiments

Looking past the single mIoU number.

SemanticKITTI 67.8 mIoU

+3.8 pp over my RangeViT baseline

nuScenes val 76.88 mIoU

+1.68 pp over RangeViT

Input frame 64 x 2048 range image

the full projected scan

TinyViM-Base 13.83M params

38.50 ms in the A100 profile

Results

What finally worked.

RangeViM reaches 67.8 mIoU on SemanticKITTI and 76.88 mIoU on the nuScenes validation split. More importantly for me, the ablations started to make sense: backbone size, decoder choices, window behavior, and robustness tests all pointed to the same design direction instead of feeling like isolated tricks.

Limitations & Future Work

What I would not claim yet.

This is still a single-scan range-view method, so there is plenty left to improve. I would like to explore stronger 2D-3D feature coupling, temporal fusion, better robustness under corrupted LiDAR inputs, and more realistic embedded profiling beyond the A100 measurements in the paper.