Acquisition
The sensor fires laser pulses and measures return time, turning the world into a dense 3D point cloud.
KES 2026 / First Author
My thesis work on full-resolution range-view LiDAR segmentation, built around a lightweight Vision Mamba backbone.
Introduction
A LiDAR scan is basically a large set of 3D points. Each point has (x, y, z) coordinates and often an intensity value, but by itself it does not say what it is. A road point, a tree point, and a car point are just numbers until a model gives them meaning.
Semantic segmentation is the part that adds that meaning. The model predicts a class for every point, so the scene becomes useful for downstream systems: road, car, pedestrian, vegetation, building, and many other labels instead of one raw geometric cloud.
I liked this problem because it is not only about recognizing objects. It also asks the model to label background structure, free space, poles, terrain, and messy boundary regions. RangeViM is my attempt to make that dense understanding accurate enough while still keeping the pipeline practical.
The sensor fires laser pulses and measures return time, turning the world into a dense 3D point cloud.
The raw points are cleaned, aligned, and projected into a range image so the network can process them faster.
The segmentation model predicts class scores in range-view space before sending labels back to points.
The labeled scan can then support driving stacks, mapping, robotics, or other 3D scene understanding tasks.
Architecture
raw x, y, z and intensity
turn the scan into a 64 x 2048 image
mix local conv features with long-range SSM context
recover dense detail across scales
send the labels back to points
Problem
Range-view methods are attractive because they turn a 3D scan into something closer to an image, so inference can be much faster than fully 3D processing. The problem is that strong context modeling, especially with Transformer attention, gets expensive at full resolution. RangeViM keeps the full horizontal field of view instead of relying on cropped inference.
Backbone Choice
My first serious baseline was RangeViT, a pioneering Transformer-based method for range-view LiDAR images. Earlier projection-based models were mostly CNN-oriented, while RangeViT showed that attention could work very well in this setting and reach near-SOTA accuracy among range-projection methods.
A Vision Transformer is good at modeling broad dependencies across the range image, but self-attention grows quickly with token count. At 64 x 2048 resolution, the cost becomes hard to ignore and cropped inference starts to look tempting even though it weakens full-scene context.
Before landing here, I tried a lot of ideas that did not survive experiments: Swin-style attention, MetaFormer/PVT-like variants, better projection recovery, and even hybrid range-view plus point-based directions inspired by works such as FRNet and HarDNet-style pipelines. Many versions became heavier without giving the performance I expected.
TinyViM gave me a way to keep local geometric detail with convolutional stages, while using state-space blocks for wider context at a lower cost than full attention. It was not magic, but it finally matched the shape of the problem better than my earlier attempts.
Technical Decisions
I wanted the model to see the whole projected scan at once instead of guessing across crop boundaries.
Convolutions handle sharp local geometry, while state-space blocks carry wider scene context without paying full attention cost.
The decoder uses wider horizontal kernels because LiDAR range images stretch scene context mostly along the azimuth direction.
Experiments
+3.8 pp over my RangeViT baseline
+1.68 pp over RangeViT
the full projected scan
38.50 ms in the A100 profile
Results
RangeViM reaches 67.8 mIoU on SemanticKITTI and 76.88 mIoU on the nuScenes validation split. More importantly for me, the ablations started to make sense: backbone size, decoder choices, window behavior, and robustness tests all pointed to the same design direction instead of feeling like isolated tricks.
Limitations & Future Work
This is still a single-scan range-view method, so there is plenty left to improve. I would like to explore stronger 2D-3D feature coupling, temporal fusion, better robustness under corrupted LiDAR inputs, and more realistic embedded profiling beyond the A100 measurements in the paper.