×

ROOTS: object-centric representation and rendering of 3D scenes. (English) Zbl 07626774

Summary: A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations. Recent works either achieve object-centric generation but without the ability to infer the representation, or achieve 3D scene representation learning but without object-centric compositionality. Therefore, learning to both represent and render 3D scenes with object-centric compositionality remains elusive. In this paper, we propose a probabilistic generative model for learning to build modular and compositional 3D object models from partial observations of a multi-object scene. The proposed model can (i) infer the 3D object representations by learning to search and group object areas, and also (ii) render from an arbitrary viewpoint not only individual objects but also the full scene by compositing the objects. The entire learning process is unsupervised and end-to-end. In experiments, in addition to generation quality, we also demonstrate that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings. Results can be found on our project website: https://sites.google.com/view/roots3d.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

References:

[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3D point clouds. InInternational Conference
[2] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction. InInternational Conference on Learning
[3] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic generalization: What is required and can it be learned? InInternational Conference on Learning Representations, 2019.
[4] Jonathan T. Barron. Continuously differentiable exponential linear units.arXiv preprint arXiv:1704.07483, 2017.
[5] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks.arXiv preprint
[6] Blender Online Community.Blender—a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam, 2017.
[7] L´eon Bottou. From machine learning to machine reasoning.Machine Learning, 94(2): 133-149, 2014.
[8] Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. MONet: Unsupervised scene decomposition and representation.arXiv preprint arXiv:1901.11390, 2019.
[9] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An information-rich 3D model repository.arXiv preprint
[10] Ricson Cheng, Ziyan Wang, and Katerina Fragkiadaki. Geometry-aware recurrent neural networks for active visual recognition. InAdvances in Neural Information Processing Systems, pages 5081-5091, 2018.
[11] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3DR2N2: A unified approach for single and multi-view 3D object reconstruction. InProceed
[12] Eric Crawford and Joelle Pineau. Spatially invariant unsupervised object detection with convolutional neural networks. InThirty-Third AAAI Conference on Artificial Intelligence, 2019.
[13] Eric Crawford and Joelle Pineau. Learning 3D object-oriented world models from unlabeled videos.Workshop on Object-Oriented Learning at ICML, 2020.
[14] Yilun Du, Zhijian Liu, Hector Basevi, Ales Leonardis, Bill Freeman, Josh Tenenbaum, and Jiajun Wu. Learning to exploit stability for 3D scene parsing. InAdvances in Neural
[15] Emilien Dupont, Miguel Angel Bautista, Alex Colburn, Aditya Sankar, Carlos Guestrin, Josh Susskind, and Qi Shan. Equivariant neural rendering. InInternational Conference
[16] Sebastien Ehrhardt, Oliver Groth, Aron Monszpart, Martin Engelcke, Ingmar Posner, Niloy Mitra, and Andrea Vedaldi. RELATE: Physically plausible multi-object scene synthesis using structured latent spaces.arXiv preprint arXiv:2007.01272, 2020.
[17] Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and Ingmar Posner. GENESIS: Generative scene inference and sampling of object-centric latent representations. In
[18] Martin Engelcke, Oiwi Parker Jones, and Ingmar Posner. Genesis-v2: Inferring unordered object representations without iterative refinement.arXiv preprint arXiv:2104.09958, 2021.
[19] S.M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, and Geoffrey E. Hinton. Attend, Infer, Repeat: Fast scene understanding with generative models. InAdvances in Neural Information Processing Systems, pages 3225-3233, 2016.
[20] S.M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural scene representation and rendering.Science, 360(6394):1204-1210, 2018.
[21] Mark Everingham, Luc Van Gool, Christopher K.I. Williams, John Winn, and Andrew Zisserman. The PascalVisual Object Classes (VOC) challenge.International Journal of
[22] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in
[23] Klaus Greff, Sjoerd van Steenkiste, and J¨urgen Schmidhuber. Neural expectation maximization. InAdvances in Neural Information Processing Systems, pages 6691-6701, 2017.
[24] Klaus Greff, Rapha¨el Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. InInternational Conference on
[25] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. InAdvances In Neural Information Processing Systems, pages 3549-3557, 2016.
[26] Richard Hartley and Andrew Zisserman.Multiple View Geometry in Computer Vision. Cambridge University Press, 2003. · Zbl 0956.68149
[27] Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner.β-VAE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on
[28] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning - Lecture 6a - Overview of mini-batch gradient descent. 2012.
[29] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.Neural computation, 9 (8):1735-1780, 1997.
[30] Bruce Hood and Laurie Santos.The Origins of Object Knowledge. Oxford University Press, 2009.
[31] Øyvind Arne Høydal, Emilie Ranheim Skytøen, Sebastian Ola Andersson, May-Britt Moser, and Edvard I. Moser. Object-vector coding in the medial entorhinal cortex.Nature, 568
[32] Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, and Song-Chun Zhu. Cooperative holistic scene understanding: Unifying 3D object, layout, and camera pose estimation. InAdvances in Neural Information Processing Systems, pages 207-218, 2018.
[33] Lawrence Hubert and Phipps Arabie. Comparing partitions.Journal of Classification, 2(1): 193-218, 1985. · Zbl 0587.62128
[34] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. InAdvances in Neural Information Processing Systems, pages 2017-2025, 2015.
[35] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with Gumbel-Softmax. InInternational Conference on Learning Representations, 2017.
[36] Daniel Kahneman, Anne Treisman, and Brian J. Gibbs. The reviewing of object files: Object-specific integration of information.Cognitive Psychology, 24(2):175-219, 1992.
[37] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-specific mesh reconstruction from image collections.InProceedings of the
[38] Abhishek Kar, Christian H¨ane, and Jitendra Malik. Learning a multi-view stereo machine. InAdvances in Neural Information Processing Systems, pages 365-376, 2017.
[39] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3D mesh renderer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[40] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[41] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InInternational Conference on Learning Representations, 2014. · Zbl 1431.68002
[42] Ananya Kumar, S.M. Ali Eslami, Danilo Jimenez Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, and Murray Shanahan. Consistent generative query networks.arXiv preprint arXiv:1807.02033, 2018.
[43] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people.Behavioral and Brain Sciences, 40, 2017.
[44] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas Geiger. Towards unsupervised learning of generative models for 3D controllable image synthesis. InProceedings of the
[45] Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition. InInternational Conference on Learning
[46] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems, pages 11525- 11538, 2020.
[47] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete distribution: A continuous relaxation of discrete random variables.InInternational Conference on
[48] Alex Martin. The representation of object concepts in the brain.Annual Review of Psychology, 58:25-45, 2007.
[49] Daniel Maturana and Sebastian Scherer. VoxNet: A 3D convolutional neural network for real-time object recognition. In2015 IEEE/RSJ International Conference on Intelligent
[50] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProceedings of the European Conference on Computer Vision, 2020.
[51] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised learning of 3D representations from natural images. InProceedings
[52] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yongliang Yang, and Niloy Mitra. BlockGAN: Learning 3D object-aware scene representations from unlabelled images. In
[53] Jonas Peters, Dominik Janzing, and Bernhard Sch¨olkopf.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017. · Zbl 1416.62012
[54] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. InProceedings of the IEEE Conference
[55] William M. Rand. Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846-850, 1971.
[56] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on
[57] Edmund T. Rolls, Jianzhong Xiang, and Leonardo Franco. Object, space, and object-space representations in the primate hippocampus.Journal of Neurophysiology, 94(1):833-844, 2005.
[58] Bernhard Sch¨olkopf. Causality for machine learning.arXiv preprint arXiv:1911.10500, 2019.
[59] Daeyun Shin, Zhile Ren, Erik B. Sudderth, and Charless C. Fowlkes. 3D scene reconstruction with multi-layer depth and epipolar transformers. InProceedings of the IEEE/CVF
[60] Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural processes. InAdvances in Neural Information Processing Systems, pages 10254-10264, 2019.
[61] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. DeepVoxels: Learning persistent 3D feature embeddings. InProceedings
[62] Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3D-structure-aware neural scene representations. InAdvances in Neural
[63] Joshua Tobin, Wojciech Zaremba, and Pieter Abbeel. Geometry-aware neural rendering. In Advances in Neural Information Processing Systems, pages 11555-11565, 2019.
[64] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026-5033. IEEE, 2012.
[65] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. InProceedings of the
[66] Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, and Jitendra Malik. Factoring shape, pose, and layout from the 2D image of a 3D scene. InProceedings of the
[67] Hsiao-Yu Fish Tung, Ricson Cheng, and Katerina Fragkiadaki. Learning spatial common sense with geometry-aware recurrent networks. InProceedings of the IEEE/CVF
[68] Sjoerd van Steenkiste, Klaus Greff, and J¨urgen Schmidhuber. A perspective on objects and systematic generalization in model-based RL.Workshop on Generative Modeling and
[69] Sjoerd van Steenkiste, Karol Kurach, J¨urgen Schmidhuber, and Sylvain Gelly. Investigating object compositionality in generative adversarial networks.Neural Networks, 130:309-325, 2020.
[70] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li‘o, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning
[71] Claes von Hofsten and Elizabeth S. Spelke. Object perception and object-directed reaching in infancy.Journal of Experimental Psychology: General, 114(2):198, 1985.
[72] Nicholas Watters, Loic Matthey, Christopher P. Burgess, and Alexander Lerchner. Spatial Broadcast decoder: A simple architecture for learning disentangled representations in VAEs.arXiv preprint arXiv:1901.07017, 2019.
[73] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In
[74] Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European Conference on Computer Vision, 2018.
[75] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. InAdvances
[76] Jaesik Yoon, Gautam Singh, and Sungjin Ahn. Robustifying sequential neural processes. In International Conference on Machine Learning, 2020.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.