4D Gaussian Videos with Motion Layering

ACM SIGGRAPH 2025 (Journal Track)

1 icon Zhejiang University, 2 icon City University of Hong Kong, 3 icon The University of Hong Kong, 4 icon Hangzhou Normal University, 5 icon Ant Group, 6 icon The University of Utah
*Joint first authors, Corresponding author


Abstract

Online free-view navigation in volumetric videos requires high-quality rendering and real-time streaming in order to provide immersive user experiences. However, existing methods may not handle dynamic scenes with complex motions, and their models (e.g., dynamic NeRF and 3DGS) may not be streamable due to storage and bandwidth constraints. In this paper, we propose a novel 4D Gaussian Video (4DGV) approach that enables the creation and streaming of photorealistic, volumetric videos for dynamic scenes over the Internet. The core of our 4DGV is a novel streamable group of Gaussians (GOG) representation based on motion layering. Each GOG consists of static and dynamic points obtained via lifting 2D segmentation into 3D in motion layering, where the deformation of each dynamic point is represented as the temporal offset of its attributes. We also adaptively convert static points back to dynamic points to handle the appearance change, (e.g. moving shadows and reflections), of static objects through optimization. To support real-time streaming of 4DGV, we show that by applying quantization on Gaussian attributes and H.265 encoding on deformation offsets, our GOG representation can be significantly compressed (to around 6% of the original model size) without sacrificing the accuracy (PSNR loss less than 0.01dB). Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art volumetric video approaches, with superior rendering quality and minimum storage overheads.


Reconstruction Results

Real-time free-viewpoint rendering of reconstructed 4DGV (videos downsized).



Overview

4DGV Overview. We take a multi-view video as input and employ the 4DGV representation to transform each group of input frames into a group of Gaussians (GOG). The process involves the following stages: (a) Gaussian point initialization learns to establish static Gaussian point clouds as initialization at each keyframe timestamp. (b) Motion layering initialization separates dynamic and static points, enabling static points to be shared across multiple groups. (c) GOG reconstruction learns to deform dynamic points for modeling scene dynamics within each frame group. The points deformed to the next keyframe timestamp serve as initialization for the subsequent group. We further leverage the H.265 codec to encode time-dependent point deformations as multiple MP4 tracks for efficient compression. Finally, the reconstructed GOGs are consolidated for seamless real-time streaming.


Motion Layering

Adaptive dynamic/static point separation.

RGB
Motion Label
RGB
Motion Label

Comparison

Video comparisons on public datasets (more results here).


Related Work

[Yang et al. 2024] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction.

[Yang et al. 2024] Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting.

[Sun et al. 2024] 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos.

[Xu et al. 2024] Grid4D: 4D Decomposed Hash Encoding for High-Fidelity Dynamic Gaussian Splatting.

[Li et al. 2024] Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis.

[Lee et al. 2024] Fully Explicit Dynamic Gaussian Splatting.

[Wang et al. 2024] V3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians.

[Xu et al. 2024] Representing Long Volumetric Video with Temporal Gaussian Hierarchy.

BibTeX