Human101: Training 100+FPS Human Gaussians in 100s from 1 View

ReLER, CCAI, Zhejiang University
Arxiv 2023

*Corresponding Author
Teaser Image

One VR common use-case. (a) The user captures a short monocular video and proceeds to upload it. (b) Our model automatically selects four frames from the monocular video and after an initialization process, we can obtain an initial point cloud. (c) Our model can learn in minutes to get a dynamic human representation. (d) Our model achieves comparable or better visual quality while rendering much faster than previous works. LPIPS* = LPIPS × 103. The area of each circle is proportional to the training time required, with larger areas representing longer training durations.

Abstract

Reconstructing the human body from single-view videos plays a pivotal role in the virtual reality domain. One prevalent application scenario necessitates the rapid reconstruction of high-fidelity 3D digital humans while simultaneously ensuring real-time rendering and interaction. Existing methods often struggle to fulfill both requirements. In this paper, we introduce Human101, a novel framework adept at producing high-fidelity dynamic 3D human reconstructions from 1-view videos by training 3D Gaussians in 100 seconds and rendering in 100+ FPS. Our method leverages the strengths of 3D Gaussian Splatting, which provides an explicit and efficient representation of 3D humans. Standing apart from prior NeRF-based systems, Human101 ingeniously applies a Human-centric Forward Gaussian Animation to deform the parameters of 3D Gaussians, thereby enhancing rendering speed (i.e., rendering 1024-resolution images at an impressive 60+ FPS and rendering 512-resolution images at 100+ FPS). Experimental results indicate that our approach substantially eclipses current methods, clocking up to a 10 × surge in frames per second and delivering comparable or superior rendering quality. Code and demos will be released at https://github.com/longxiang-ai/Human101.

Pipeline

Pipeline Image

Overview of Human101. (a) Canonical Human Initialization. We use an offline model to extract 4 point cloud parts from 4 selected frames, and fuse them into a canonical point cloud, which then is converted into canonical Gaussians. (b) Human-centric Gaussian Forward Animation. We deform canonical 3D Gaussians into the target pose by modifying Gaussian positions (x), rotations (r), and scales (s). And we rotate the spherical coefficients with triangle face rotation. (c) Human-centric Gaussian Refinement. We refine positions (x), rotations (r) and scales (s) of Gaussians and refine the view direction (d) of spherical harmonics.

Training View Results

GT
Ours
InstantNvr

Novel View Results

GT
Ours
InstantNvr

Bullet Time

Ours 377

Application: Composite Rendering

BibTeX

@misc{li2023human101,
        title={Human101: Training 100+FPS Human Gaussians in 100s from 1 View},
        author={Mingwei Li and Jiachen Tao and Zongxin Yang and Yi Yang},
        year={2023},
        eprint={2312.15258},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }