


Volumetric video is produced from multiple camera performance capture studios that generally consist of synchronised cameras that simultaneously record a performance ( Collet et al., 2015 Starck and Hilton, 2007 de Aguiar et al., 2008 Carranza et al., 2003). This has the potential to allow highly-realistic content production for immersive virtual and augmented reality experiences. Volumetric video is an emerging media that allows free-viewpoint rendering and replay of dynamic scenes with the visual quality approaching that of the of captured video. Deep4D motion graphs implicitly combine multiple captured motions from a unified representation for character animation from volumetric video, allowing novel character movements to be generated with dynamic shape and appearance detail. Deep4D motion graphs allow real-tiome interactive character animation whilst preserving the plausible realism of movement and appearance from the captured volumetric video. Therefore we introduce Deep4D motion graphs, a direct application of the proposed generative representation. This encoded latent space supports the representation of multiple sequences with dynamic interpolation to transition between motions. This enables high-quality 4D volumetric video synthesis to be driven by skeletal motion, including skeletal motion capture data. A variational encoder-decoder network is employed to learn an encoded latent space that maps from 3D skeletal pose to 4D shape and appearance. We demonstrate the proposed generative model can provide a compact encoded representation capable of high-quality synthesis of 4D volumetric video with two orders of magnitude compression. A deep generative network is trained on 4D video sequences of an actor performing multiple motions to learn a generative model of the dynamic shape and appearance. 4D volumetric video achieves highly realistic reproduction, replay and free-viewpoint rendering of actor performance from multiple view video acquisition systems. This paper introduces Deep4D a compact generative representation of shape and appearance from captured 4D volumetric video sequences of people. Grenoble Alpes, Inria, CNRS, Grenoble INP (Institute of Engineering Univ. 1Centre for Vision Speech and Signal Processing, University of Surrey, Guildford, United Kingdom.João Regateiro 1,2*, Marco Volino 1 and Adrian Hilton 1
