Online free-view navigation in volumetric videos requires high-quality rendering and real-time streaming in order to provide immersive user experiences. However, existing methods may not handle dynamic scenes with complex motions, and their models (e.g., dynamic NeRF and 3DGS) may not be streamable due to storage and bandwidth constraints. In this paper, we propose a novel 4D Gaussian Video (4DGV) approach that enables the creation and streaming of photorealistic, volumetric videos for dynamic scenes over the Internet. The core of our 4DGV is a novel streamable group of Gaussians (GOG) representation based on motion layering. Each GOG consists of static and dynamic points obtained via lifting 2D segmentation into 3D in motion layering, where the deformation of each dynamic point is represented as the temporal offset of its attributes. We also adaptively convert static points back to dynamic points to handle the appearance change, (e.g. moving shadows and reflections), of static objects through optimization. To support real-time streaming of 4DGV, we show that by applying quantization on Gaussian attributes and H.265 encoding on deformation offsets, our GOG representation can be significantly compressed (to around 6% of the original model size) without sacrificing the accuracy (PSNR loss less than 0.01dB). Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art volumetric video approaches, with superior rendering quality and minimum storage overheads.
Real-time free-viewpoint rendering of reconstructed 4DGV (videos downsized).
4DGV Overview. We take a multi-view video as input and employ the 4DGV representation to transform each group of input frames into a group of Gaussians (GOG). The process involves the following stages: (a) Gaussian point initialization learns to establish static Gaussian point clouds as initialization at each keyframe timestamp. (b) Motion layering initialization separates dynamic and static points, enabling static points to be shared across multiple groups. (c) GOG reconstruction learns to deform dynamic points for modeling scene dynamics within each frame group. The points deformed to the next keyframe timestamp serve as initialization for the subsequent group. We further leverage the H.265 codec to encode time-dependent point deformations as multiple MP4 tracks for efficient compression. Finally, the reconstructed GOGs are consolidated for seamless real-time streaming.
Adaptive dynamic/static point separation.
Video comparisons on public datasets (more results here).
[Yang et al. 2024] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction.
[Yang et al. 2024] Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting.
[Sun et al. 2024] 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos.
[Xu et al. 2024] Grid4D: 4D Decomposed Hash Encoding for High-Fidelity Dynamic Gaussian Splatting.
[Li et al. 2024] Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis.
[Lee et al. 2024] Fully Explicit Dynamic Gaussian Splatting.
[Wang et al. 2024] V3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians.
[Xu et al. 2024] Representing Long Volumetric Video with Temporal Gaussian Hierarchy.