Point Cloud Compression with Bits-back Coding

Nguyen Quang Hieu, Minh Nguyen, Dinh Thai Hoang, Diep N. Nguyen, and Eryk Dutkiewicz N. Q. Hieu, D. T. Hoang, D. N. Nguyen, and E. Dutkiewicz are with the School of Electrical and Data Engineering, University of Technology Sydney, Australia. M. Nguyen is with Fraunhofer FOKUS, Germany.
Abstract

This paper introduces a novel lossless compression method for compressing geometric attributes of point cloud data with bits-back coding. Our method specializes in using a deep learning-based probabilistic model to estimate the Shannon’s entropy of the point cloud information, i.e., geometric attributes of the 3D floating points. Once the entropy of the point cloud dataset is estimated with a convolutional variational autoencoder (CVAE), we use the learned CVAE model to compress the geometric attributes of the point clouds with the bits-back coding technique. The novelty of our method with bits-back coding specializes in utilizing the learned latent variable model of the CVAE to compress the point cloud data. By using bits-back coding, we can capture the potential correlation between the data points, such as similar spatial features like shapes and scattering regions, into the lower-dimensional latent space to further reduce the compression ratio. The main insight of our method is that we can achieve a competitive compression ratio as conventional deep learning-based approaches, while significantly reducing the overhead cost of storage and/or communicating the compression codec, making our approach more applicable in practical scenarios. Throughout comprehensive evaluations, we found that the cost for the overhead is significantly small, compared to the reduction of the compression ratio when compressing large point cloud datasets. Experiment results show that our proposed approach can achieve a compression ratio of 1.56 bit-per-point on average, which is significantly lower than the baseline approach such as Google’s Draco with a compression ratio of 1.83 bit-per-point.

Index Terms:
Deep learning methods, deep learning for visual perception.

I Introduction

Point cloud offers a flexible and visually rich representation of 3D data, with the ability to capture detailed spatial information. This makes point cloud invaluable in a wide range of applications, including automotive LiDAR for autonomous driving, 3D scene understanding in robotics, and immersive experiences in virtual reality and augmented reality. Despite the flexibility and usefulness of the point cloud, the unstructured format of the point cloud poses significant challenges in data storage and often requires efficient data compression techniques [1]. For example, a single sweep of LiDAR sensors in autonomous driving can produce 100,000 points, resulting in 84 billion points per day [2]. Moreover, a single point in the point cloud may require 32 or 64 bits to represent the geometric attributes, i.e., (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) position in a 3D coordinate, let alone the color attributes, resulting in a massive amount of data storage. However, the unordered nature of point clouds presents a significant challenge for efficient compression. Consequently, point cloud compression has emerged as a critical area of research, focusing on transforming raw point cloud data into a more structured format that can be more efficiently compressed [1].

Refer to caption
Figure 1: Overview of a point cloud compression pipeline. Decompression can be done in a reversed order.

A typical point cloud compression process can be illustrated in Fig. 1, beginning with “spatial processing”, which converts raw point cloud data into ordered formats such as tree-based structures or voxel grids. The spatial processing step is often a lossy process and is critical as it facilitates the subsequent transformation of the data into a binary sequence. Once organized in the tree-based or voxel format, the data can be processed by a lossless compression codec, which involves an important step of entropy estimation. The entropy estimation step is critical to quantify the uncertainty or randomness in the data, typically using statistical tools like histograms or machine learning models to approximate Shannon’s entropy [2, 3, 4, 5]. The lossless compression codec then utilizes the estimated entropy, typically conditioned on the marginal probability of the data, to produce the final binary sequence, as depicted in Fig. 1. The most widely used lossless compression codecs are arithmetic coding [6] and the recently developed asymmetric numeral systems (ANS) [7], thanks to their implementation simplicity and low compression overhead. The main idea of these codecs is to build the cumulative distribution functions (CDFs) and the inverse CDFs from the learned marginal distribution of data, denoted as P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) in Fig. 1. Once the learned distribution functions are shared between the encoder (sender) and decoder (receiver), data samples are mapped into a binary sequence by transforming the value P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) into an interval between 0 and 1 [7, 8].

Accurate entropy estimation is crucial, as it directly impacts the efficiency of lossless compression codecs like arithmetic coding and ANS. While considerable research works have focused on enhancing entropy estimation through advanced deep learning models, e.g., [3, 4, 5] and references therein, the resulting reduction in compression ratios (defined as the number of bits required to encode a single geometric attribute) has often been modest, underscoring the need for novel methods. A key limitation of current deep learning-based entropy estimation models for point clouds is the assumption that the decoder has access to the learned marginal probability of the data, denoted as P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ). In many practical scenarios, this assumption is unrealistic, as the encoder and decoder would need to store or communicate the marginal probability. The cost of doing so could exceed the size of the compressed data itself, particularly when the point cloud input scales to hundreds of thousands of points. This will be evaluated in Section III, when we investigate the overhead associated with conventional approaches, and demonstrate how to reduce this cost with bits-back coding.

In this work, we propose a novel technique for compressing point clouds based on bits-back coding [9, 10, 11]. Our approach leverages a latent variable model, where both the encoder and decoder employ a prior distribution over the model parameters to achieve efficient data compression. The core principle of bits-back coding is to encode and decode data by utilizing the prior, posterior, and likelihood from a trained variational autoencoder (VAE) [10, 11]. This approach is advantageous for compressing large point cloud sequences, which often exhibit significant correlation between consecutive point clouds, yet do not require access to the marginal probability of the data - an element that can introduce considerable overhead. As a result, our method results in low overhead costs in terms of codec communication and installation, making it a practical solution for real-world applications requiring efficient and scalable point cloud compression. In summary, our main contributions are as follows:

  • We first develop a simplistic variational autoencoder with 3D convolutional filters to estimate Shannon’s entropy of the voxelized point clouds. The main aim of using 3D convolutional filters is to subsequently reduce the input size of the voxel grids and efficiently project the 3D voxel grids into a lower dimensional representation. At its core, our CVAE model has a latent space with Gaussian prior, enabling the model to learn the correlation between point clouds in large datasets and later use of bits-back coding in the inference step.

  • We then develop a novel compression method based on bits-back coding to compress a batch of point clouds. The main idea of using bits-back coding is to reduce the compression ratio when the size of the point clouds increases. In return, bits-back coding pays an overhead cost of inserting latent variables into the compression process. However, we provide empirical evaluations to show that this overhead cost is trivial compared to the compression gain of the bits-back coding method, as the proposed method outperforms an industrial standard point cloud compression library Draco developed by Google [12].

II System Model and Proposed Methodology

II-A Spatial Processing

Recall that the main building blocks of a point cloud compression pipeline is illustrated in Fig. 1 and our main focus in this work is the entropy estimation step and lossless compression step. For the spatial processing step, we employ a simple voxelization processing, in which the entire point cloud is divided into equal-sized 3D cubes called “voxels”. A voxel can be considered as a counterpart of “pixel” in 2D image/video processing. At the end of the spatial processing step, a raw point cloud is transformed into M=2d×2d×2d𝑀superscript2𝑑superscript2𝑑superscript2𝑑M=2^{d}\times 2^{d}\times 2^{d}italic_M = 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT voxels in which d𝑑ditalic_d is the axis-wise depth value of the processed point cloud. For example, with d=7𝑑7d=7italic_d = 7, we use 7777 bits per dimension to represent the 27=128superscript271282^{7}=1282 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = 128 positions of the points in a 3D coordinate. The resulting M𝑀Mitalic_M voxels now are well structured and can be used for the entropy estimation step. Note that further complicated spatial processing, such as density embedding [5] and building Octrees [2] or KD-trees [12], can be applied for more efficient data presentation but it’s not our main focus of this work as we will later show that with simplistic voxelization step described above, we can still outperform other tree-based approaches in terms of compression ratio.

II-B Entropy Estimation with CVAE Model

Refer to caption
Figure 2: (a) Our entropy estimation approach with the proposed Convolutional Variational Autoencoder (CVAE), (b) detailed architecture of the proposed CVAE, and (c) our bits-back coding approach.

In Fig. 2, we illustrate our approach to compress the processed point clouds, i.e., voxels, with bits-back coding. As illustrated in Fig. 2(a), we develop our entropy estimation model for M𝑀Mitalic_M voxels based on a convolutional variational autoencoder (CVAE). Note that the “CVAE’s encoder” and “CVAE’s decoder” terms in Fig. 2 should not be confused with the encoder-decoder terms used for the arithmetic coding and ANS codecs. We will use the terms “CVAE’s encoder” and “CVAE’s decoder” to explicitly refer to the components of the CVAE model, thus avoiding potential confusion with general encoder-decoder terms used for lossless compression codecs like arithmetic codec and ANS. It is also noted that, in this work, we only consider compressing the geometric attributes of the point clouds, i.e., the positional values of the floating points in the 3D coordinate, and discard the color information of the points to focus extensively on demonstrating the bits-back coding capabilities. We formulate the lossless compression problem of M𝑀Mitalic_M voxels as follows.

Given an alphabet 𝒜𝒜\mathcal{A}caligraphic_A (i.e., codebook) of voxels having value either 1 or 0, i.e., 𝒜={0,1}𝒜01\mathcal{A}=\{0,1\}caligraphic_A = { 0 , 1 }, a processed point cloud having M𝑀Mitalic_M voxels can be represented as a symbol 𝒙=[x1,x2,,xM]𝒙subscript𝑥1subscript𝑥2subscript𝑥𝑀\bm{x}=[x_{1},x_{2},\ldots,x_{M}]bold_italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] (see footnote111Note that the symbol 𝒙𝒙\bm{x}bold_italic_x can be viewed as a flattened array of voxels for illustration purpose only. In actual implementation, a 3-dimensional array can be utilized for preserving the spatial structure of the point cloud data. below), in which each element xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (1mM1𝑚𝑀1\leq m\leq M1 ≤ italic_m ≤ italic_M) receives value either 0 or 1, i.e., xm{0,1}subscript𝑥𝑚01x_{m}\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ { 0 , 1 }. In the context of entropy estimation, let’s define the real-valued probability of the variable xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT receiving value i𝑖iitalic_i (i=0,1𝑖01i=0,1italic_i = 0 , 1) as P(xm=i)𝑃subscript𝑥𝑚𝑖P(x_{m}=i)italic_P ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i ), where i=01P(xm=i)=1superscriptsubscript𝑖01𝑃subscript𝑥𝑚𝑖1\sum_{i=0}^{1}P(x_{m}=i)=1∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i ) = 1. We have the Shannon information content of an outcome xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is h(xm=i)=log1pisubscript𝑥𝑚𝑖1subscript𝑝𝑖h(x_{m}=i)=\log\frac{1}{p_{i}}italic_h ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i ) = roman_log divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, and the binary entropy function of xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is

H(xm)=i=01P(xm=i)log1P(xm=i).𝐻subscript𝑥𝑚superscriptsubscript𝑖01𝑃subscript𝑥𝑚𝑖1𝑃subscript𝑥𝑚𝑖H(x_{m})=\sum_{i=0}^{1}P(x_{m}=i)\log\frac{1}{P(x_{m}=i)}.italic_H ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i ) roman_log divide start_ARG 1 end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i ) end_ARG . (1)

Note that from the above equation and hereafter we use log\logroman_log to denote the base two logarithm (i.e., usually denoted as log2subscript2\log_{2}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) for notational simplicity. For compression of M𝑀Mitalic_M variable xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where each xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is an element of the voxelized point cloud 𝒙𝒙\bm{x}bold_italic_x, the entropy of 𝒙𝒙\bm{x}bold_italic_x can be calculated by [8, Equation 8.1]:

H(𝒙)=xm𝒜P(x1,x2,,xM)log1P(x1,x2,,xM)𝐻𝒙subscriptsubscript𝑥𝑚𝒜𝑃subscript𝑥1subscript𝑥2subscript𝑥𝑀1𝑃subscript𝑥1subscript𝑥2subscript𝑥𝑀H(\bm{x})=\sum_{x_{m}\in\mathcal{A}}P(x_{1},x_{2},\ldots,x_{M})\log\frac{1}{P(% x_{1},x_{2},\ldots,x_{M})}\\ italic_H ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) roman_log divide start_ARG 1 end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) end_ARG (2)

In practice, one often requires probabilistic models to estimate the exact marginal probability P(𝒙)=P(x1,x2,,xM)𝑃𝒙𝑃subscript𝑥1subscript𝑥2subscript𝑥𝑀P(\bm{x})=P(x_{1},x_{2},\ldots,x_{M})italic_P ( bold_italic_x ) = italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) of the vector 𝒙=[x1,x2,,xM]𝒙subscript𝑥1subscript𝑥2subscript𝑥𝑀\bm{x}=[x_{1},x_{2},\ldots,x_{M}]bold_italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ]. Let’s denote P𝜽(xm=i|x1,,xm1)subscript𝑃𝜽subscript𝑥𝑚conditional𝑖subscript𝑥1subscript𝑥𝑚1P_{\bm{\theta}}(x_{m}=i|x_{1},\ldots,x_{m-1})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) being the estimated conditional probability with parameters 𝜽𝜽\bm{\theta}bold_italic_θ to the true conditional probability P(xm=i|x1,,xm1)𝑃subscript𝑥𝑚conditional𝑖subscript𝑥1subscript𝑥𝑚1P(x_{m}=i|x_{1},\ldots,x_{m-1})italic_P ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ). Once the conditional probabilities are estimated, the approximate marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) can be calculated using the chain rule, i.e.,

P𝜽(𝒙)=m=1MP𝜽(xmx1,x2,,xm1).subscript𝑃𝜽𝒙superscriptsubscriptproduct𝑚1𝑀subscript𝑃𝜽conditionalsubscript𝑥𝑚subscript𝑥1subscript𝑥2subscript𝑥𝑚1P_{\bm{\theta}}(\bm{x})=\prod_{m=1}^{M}P_{\bm{\theta}}(x_{m}\mid x_{1},x_{2},% \dots,x_{m-1}).italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) . (3)

As the conditional probabilities are now can be approximated with potential estimation errors, the code length of the compressed symbol 𝒙𝒙\bm{x}bold_italic_x is greater than the optimal code length, i.e., the entropy H(𝒙)𝐻𝒙H(\bm{x})italic_H ( bold_italic_x ) defined in (2). In particular, the average code length of the lossless compressed sequence 𝒙𝒙\bm{x}bold_italic_x with a probabilistic model 𝜽𝜽\bm{\theta}bold_italic_θ can be calculated by [8, Equation 5.23]:

L(𝒙)=H(𝒙)+xm𝒜P(𝒙)logP(𝒙)P𝜽(𝒙).𝐿𝒙𝐻𝒙subscriptsubscript𝑥𝑚𝒜𝑃𝒙𝑃𝒙subscript𝑃𝜽𝒙L(\bm{x})=H(\bm{x})+\sum_{x_{m}\in\mathcal{A}}P(\bm{x})\log\frac{P(\bm{x})}{P_% {\bm{\theta}}(\bm{x})}.italic_L ( bold_italic_x ) = italic_H ( bold_italic_x ) + ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_P ( bold_italic_x ) roman_log divide start_ARG italic_P ( bold_italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) end_ARG . (4)

In the above equation, the sum xm𝒙P(𝒙)logP(𝒙)P𝜽(𝒙)subscriptsubscript𝑥𝑚𝒙𝑃𝒙𝑃𝒙subscript𝑃𝜽𝒙\sum_{x_{m}\in\bm{x}}P(\bm{x})\log\frac{P(\bm{x})}{P_{\bm{\theta}}(\bm{x})}∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_italic_x end_POSTSUBSCRIPT italic_P ( bold_italic_x ) roman_log divide start_ARG italic_P ( bold_italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) end_ARG is a Kullback-Leibler (KL) divergence, which is a non-negative value describing the relative distance between the true probability distribution P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) and the estimated probability distribution P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ). Intuitively, the code length L(𝒙)𝐿𝒙L(\bm{x})italic_L ( bold_italic_x ) (measured in bits) above can be minimized by minimizing the KL divergence, which is also known as the “free energy” calculated by logP(𝒙)P𝜽(𝒙)𝑃𝒙subscript𝑃𝜽𝒙\log\frac{P(\bm{x})}{P_{\bm{\theta}}(\bm{x})}roman_log divide start_ARG italic_P ( bold_italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) end_ARG. In other words, if P𝜽(𝒙)=P(𝒙)subscript𝑃𝜽𝒙𝑃𝒙P_{\bm{\theta}}(\bm{x})=P(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = italic_P ( bold_italic_x ), we have L(𝒙)=H(𝒙)𝐿𝒙𝐻𝒙L(\bm{x})=H(\bm{x})italic_L ( bold_italic_x ) = italic_H ( bold_italic_x ), which means the probabilistic model with parameters 𝜽𝜽\bm{\theta}bold_italic_θ can correctly estimate entropy of the data, thus resulting in optimal compression ratio with minimum code length of H(𝒙)𝐻𝒙H(\bm{x})italic_H ( bold_italic_x ) bits.

In our work, we are interested in using a deep learning-based probabilistic model for estimating the marginal probability P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) of the voxels. In particular, we utilize a variational autoencoder (VAE) [13] as a core probabilistic model, similar to the approach in [10]. Note that the vanilla model with only fully connected layers in [10] is not applicable for processing 3D data like the voxels in this work. Thus, we develop a complete yet simplistic CVAE model in which 3D convolutional layers are staked on top of the fully connected layers to process the high-dimensional voxel data. The main idea of utilizing the 3D convolutional layers is to subsequently reduce the size of the input vectors before feeding into the fully connected layers. At the same time, the use of these convolutional layers is critical to capture and learn the spatial information of the voxel data. The model architecture of the Convolutional Variational Autoencoder (CVAE) is illustrated in Fig. 2(b). A detailed explanation of the architecture and the training procedure of the CVAE model can be found in Appendix A. As illustrated in Fig. 2(a), the CVAE model works as a probabilistic model to estimate the true probabilities P𝜽(𝒙)P(𝒙)subscript𝑃𝜽𝒙𝑃𝒙P_{\bm{\theta}}(\bm{x})\approx P(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) ≈ italic_P ( bold_italic_x ), with learnable parameters 𝜽𝜽\bm{\theta}bold_italic_θ (i.e., weights of the deep neural network).

In the literature, it has been shown that using probabilistic models with deep neural networks [2, 3, 4] can produce competitive compression ratios as the deep neural networks can estimate the probability P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) with high accuracy. The main idea of the works in [2, 3, 4], and references therein, is that, after learning the probabilistic model from the training data, the learned model is used to produce the probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) for the lossless compression codec, e.g., arithmetic codec or ANS codec. As shown in Fig. 2(c), in conventional methods in [2, 3, 4], the learned probabilistic model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) is used as the input of the arithmetic codec to produce the final binary sequence. The main process of a general arithmetic codec or ANS codec is building the cumulative distribution function (CDF) and inverse CDF for the encoder and decoder, respectively, based on the conditional probabilities in equation (3). These CDF and inverse CDF will be used as mapping functions that convert the probability of an input vector to the unit interval between 0 and 1, resulting in a compressed binary representation with expected code length described in equation (4) [6, 7].

In this work, we refer to the conventional methods in [2, 3, 4] as “sequential coding” as all the works utilize arithmetic coding with sequential process to compress a batch of point cloud (or voxels) data. The main idea of the sequential coding process is to iteratively compress the newly arrived input data into an existing compressed sequence, resulting in a final nested message (encoded data). This can be achieved by iteratively retrieving the conditional probabilities P𝜽(xm=i|x1,,xm1)subscript𝑃𝜽subscript𝑥𝑚conditional𝑖subscript𝑥1subscript𝑥𝑚1P_{\bm{\theta}}(x_{m}=i|x_{1},\ldots,x_{m-1})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_i | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) to produce the final marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ), as described in equation (3). When it comes to sequential coding for a batch of B𝐵Bitalic_B different point clouds, each of which has M𝑀Mitalic_M voxels, the sequential coding may require B𝐵Bitalic_B different pairs of CDF-inverse CDF, resulting in B×M𝐵𝑀B\times Mitalic_B × italic_M conditional probabilities with large amounts of overhead cost of storage of communication. With this scheme, the potential correlation between the B𝐵Bitalic_B point clouds, such as shapes and spatial regions of the floating points, is not fully utilized, as the coding/decoding process strictly follows the chain rule to perform compression/decompression steps.

The overhead cost of requiring B×M𝐵𝑀B\times Mitalic_B × italic_M conditional probabilities is usually ignored in the literature works [2, 3, 4] as they usually assume the access to these conditional probabilities, or the equivalent CDFs-inverse CDFs, by sharing the pre-trained deep learning models at both encoder (sender) and decoder (receiver) sides. However, the use of the pre-trained deep learning model at the receiver may require additional cost of storing or communicating the codec decoder. The cost of this overhead may even exceed the cost of compressing the point cloud data itself, especially in the case of deeper neural networks, or large point cloud datasets. This motivates us to replace the sequential coding method with the novel bits-back coding method.

II-C Bits-back Coding for Voxelized Point Cloud Data

The bits-back coding method was first described by Frey and Hinton in [9], and was further developed by Townsend et al. in [10]. The main idea behind bits-back coding is that the encoder (sender) and decoder (receiver) of a given codec, e.g., arithmetic coding [6] or asymmetric numeral system (ANS) [7], will share a latent variable model P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ) (a prior), instead of the marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ). The marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) then can be inferred by using Bayes’ update rule with the prior P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ), the posterior Qϕ(𝒛𝒙)subscript𝑄bold-italic-ϕconditional𝒛𝒙Q_{\bm{\phi}}(\bm{z\mid x})italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ), and the likelihood P𝜽(𝒙𝒛)subscript𝑃𝜽conditional𝒙𝒛P_{\bm{\theta}}(\bm{x\mid z})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ), which will be described later. The use of a latent variable model is useful to exploit the “correlation”, or “side information”, between the data samples when compressing a batch of such data samples together. The correlation between data samples will gain advantages, in terms of compression ratio, when compressing large batches of data, which is usually the case of compressing data with deep learning-based probabilistic models. From the optimal coding rate principle through minimizing the KL divergence in (4), we formalize the bits-back coding approach as follows.

Refer to caption
Figure 3: Encode (compress) and decode (decompress) process of the bits-back coding. The rectangles with dashed lines denote the decrease in the code length of the message, and rectangles with solid lines denote the increase in the code length. The probability values above the rectangles are linked to the derivative in (6). The rectangles are placed in a stack data structure of ANS with the head of the stack on the left-hand side.

Starting from equation (4), we now introduce a latent variable 𝒛𝒛\bm{z}bold_italic_z which has a multivariate normal (Gaussian) prior P(𝒛)=𝒩(𝒛;0,𝑰)𝑃𝒛𝒩𝒛0𝑰P(\bm{z})=\mathcal{N}(\bm{z};0,\bm{I})italic_P ( bold_italic_z ) = caligraphic_N ( bold_italic_z ; 0 , bold_italic_I ), where 𝑰𝑰\bm{I}bold_italic_I is the identity matrix. As we now trying to infer true probability P(𝒙)𝑃𝒙P(\bm{x})italic_P ( bold_italic_x ) from our voxel data with a latent variable model, the marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) can be written as the total probability of observing 𝒙𝒙\bm{x}bold_italic_x under all possible values of the latent variable 𝒛𝒛\bm{z}bold_italic_z, i.e.,

P𝜽(𝒙)=𝒛P𝜽(𝒙,𝒛)𝑑𝒛.subscript𝑃𝜽𝒙subscript𝒛subscript𝑃𝜽𝒙𝒛differential-d𝒛\begin{split}P_{\bm{\theta}}(\bm{x})&=\int_{\bm{z}}P_{\bm{\theta}}(\bm{x,z})d% \bm{z}.\end{split}start_ROW start_CELL italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) end_CELL start_CELL = ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_, bold_italic_z ) italic_d bold_italic_z . end_CELL end_ROW (5)

By taking the log\logroman_log for the both sides of equation (5), introducing the approximate posterior probability Qϕ(𝒛𝒙)subscript𝑄bold-italic-ϕconditional𝒛𝒙Q_{\bm{\phi}}(\bm{z\mid x})italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ), and applying Jensen’s inequality (i.e., f(𝒛𝒛p(𝒛)𝑑𝒛)𝒛f(𝒛)p(𝒛)𝑑𝒛𝑓subscript𝒛𝒛𝑝𝒛differential-d𝒛subscript𝒛𝑓𝒛𝑝𝒛differential-d𝒛f\left(\int_{\bm{z}}\bm{z}p(\bm{z})d\bm{z}\right)\leq\int_{\bm{z}}f(\bm{z})p(% \bm{z})d\bm{z}italic_f ( ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT bold_italic_z italic_p ( bold_italic_z ) italic_d bold_italic_z ) ≤ ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_f ( bold_italic_z ) italic_p ( bold_italic_z ) italic_d bold_italic_z) [13], we have:

logP𝜽(𝒙)=log𝒛P𝜽(𝒙,𝒛)𝑑𝒛subscript𝑃𝜽𝒙subscript𝒛subscript𝑃𝜽𝒙𝒛differential-d𝒛\displaystyle\log P_{\bm{\theta}}(\bm{x})=\log\int_{\bm{z}}P_{\bm{\theta}}(\bm% {x,z})d\bm{z}roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = roman_log ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_, bold_italic_z ) italic_d bold_italic_z (6)
=log𝒛Qϕ(𝒛𝒙)P𝜽(𝒙,𝒛)Qϕ(𝒛𝒙)𝑑𝒛absentsubscript𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙subscript𝑃𝜽𝒙𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙differential-d𝒛\displaystyle=\log\int_{\bm{z}}Q_{\bm{\phi}}(\bm{z\mid x})\frac{P_{\bm{\theta}% }(\bm{x,z})}{Q_{\bm{\phi}}(\bm{z\mid x})}d\bm{z}= roman_log ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) divide start_ARG italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_, bold_italic_z ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_ARG italic_d bold_italic_z
𝒛Qϕ(𝒛𝒙)log(P𝜽(𝒙,𝒛)Qϕ(𝒛𝒙))𝑑𝒛absentsubscript𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙subscript𝑃𝜽𝒙𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙differential-d𝒛\displaystyle\geq\int_{\bm{z}}Q_{\bm{\phi}}(\bm{z\mid x})\log\Big{(}\frac{P_{% \bm{\theta}}(\bm{x,z})}{Q_{\bm{\phi}}(\bm{z\mid x})}\Big{)}d\bm{z}≥ ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) roman_log ( divide start_ARG italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_, bold_italic_z ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_ARG ) italic_d bold_italic_z
=𝒛Qϕ(𝒛𝒙)[logP𝜽(𝒙𝒛)\displaystyle=\int_{\bm{z}}Q_{\bm{\phi}}(\bm{z\mid x})\Big{[}\log P_{\bm{% \theta}}(\bm{x\mid z})= ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) [ roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z )
+logP(𝒛)logQϕ(𝒛𝒙)]d𝒛\displaystyle\qquad\qquad\qquad\qquad+\log P(\bm{z})-\log Q_{\bm{\phi}}(\bm{z% \mid x})\Big{]}d\bm{z}+ roman_log italic_P ( bold_italic_z ) - roman_log italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) ] italic_d bold_italic_z
=𝔼Qϕ(𝒛𝒙)[logP𝜽(𝒙𝒛)+logP(𝒛)logQϕ(𝒛𝒙)]absentsubscript𝔼subscript𝑄bold-italic-ϕconditional𝒛𝒙delimited-[]subscript𝑃𝜽conditional𝒙𝒛𝑃𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙\displaystyle=\mathbb{E}_{Q_{\bm{\phi}}(\bm{z\mid x})}\Big{[}\log P_{\bm{% \theta}}(\bm{x\mid z})+\log P(\bm{z})-\log Q_{\bm{\phi}}(\bm{z\mid x})\Big{]}= blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_POSTSUBSCRIPT [ roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ) + roman_log italic_P ( bold_italic_z ) - roman_log italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) ]
=𝔼Qϕ(𝒛𝒙)[logP𝜽(𝒙𝒛)]𝔼Qϕ(𝒛𝒙)[logQϕ(𝒛𝒙)P(𝒛)]absentsubscript𝔼subscript𝑄bold-italic-ϕconditional𝒛𝒙delimited-[]subscript𝑃𝜽conditional𝒙𝒛subscript𝔼subscript𝑄bold-italic-ϕconditional𝒛𝒙delimited-[]subscript𝑄bold-italic-ϕconditional𝒛𝒙𝑃𝒛\displaystyle=\mathbb{E}_{Q_{\bm{\phi}}(\bm{z\mid x})}\Big{[}\log P_{\bm{% \theta}}(\bm{x\mid z})\Big{]}-\mathbb{E}_{Q_{\bm{\phi}}(\bm{z\mid x})}\Big{[}% \log\frac{Q_{\bm{\phi}}(\bm{z\mid x})}{P(\bm{z})}\Big{]}= blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_POSTSUBSCRIPT [ roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ) ] - blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_ARG start_ARG italic_P ( bold_italic_z ) end_ARG ]

The lower bound of logP𝜽(𝒙)subscript𝑃𝜽𝒙\log P_{\bm{\theta}}(\bm{x})roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) above is the quantitative value of the “free energy” of the latent variable modeling process [9]. Its meaning is two-fold as described as follows. First, in a bits-back encoding process (see Fig. 3), this is equivalent to the change of the expected code length [10, 11] (i.e., line 5 of (6)):

ΔL=logP𝜽(𝒙𝒛)+logP(𝒛)logQϕ(𝒛𝒙).Δ𝐿subscript𝑃𝜽conditional𝒙𝒛𝑃𝒛subscript𝑄bold-italic-ϕconditional𝒛𝒙\Delta L=\log P_{\bm{\theta}}(\bm{x\mid z})+\log P(\bm{z})-\log Q_{\bm{\phi}}(% \bm{z\mid x}).roman_Δ italic_L = roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ) + roman_log italic_P ( bold_italic_z ) - roman_log italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) . (7)

Second, the evidence lower bound (ELBO) value of logP𝜽(𝒙)subscript𝑃𝜽𝒙\log P_{\bm{\theta}}(\bm{x})roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) gives an objective for optimizing the variational autoencoder (VAE) [13] (i.e., line 6 of (6)):

𝜽;ϕ=𝔼Qϕ(𝒛𝒙)[logP𝜽(𝒙𝒛)]𝔼Qϕ(𝒛𝒙)[logQϕ(𝒛𝒙)P(𝒛)].subscript𝜽bold-italic-ϕsubscript𝔼subscript𝑄bold-italic-ϕconditional𝒛𝒙delimited-[]subscript𝑃𝜽conditional𝒙𝒛subscript𝔼subscript𝑄bold-italic-ϕconditional𝒛𝒙delimited-[]subscript𝑄bold-italic-ϕconditional𝒛𝒙𝑃𝒛\mathcal{L}_{\bm{\theta;\phi}}=\mathbb{E}_{Q_{\bm{\phi}}(\bm{z\mid x})}\Big{[}% \log P_{\bm{\theta}}(\bm{x\mid z})\Big{]}-\mathbb{E}_{Q_{\bm{\phi}}(\bm{z\mid x% })}\Big{[}\log\frac{Q_{\bm{\phi}}(\bm{z\mid x})}{P(\bm{z})}\Big{]}.caligraphic_L start_POSTSUBSCRIPT bold_italic_θ bold_; bold_italic_ϕ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_POSTSUBSCRIPT [ roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ) ] - blackboard_E start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ) end_ARG start_ARG italic_P ( bold_italic_z ) end_ARG ] . (8)

Therefore, the main intention of the following methodology in this work is that we first train the CVAE model with training data to obtain the optimized parameters {𝜽,ϕ}𝜽bold-italic-ϕ\{\bm{\theta,\phi}\}{ bold_italic_θ bold_, bold_italic_ϕ } through minimizing 𝜽;ϕsubscript𝜽bold-italic-ϕ\mathcal{L}_{\bm{\theta;\phi}}caligraphic_L start_POSTSUBSCRIPT bold_italic_θ bold_; bold_italic_ϕ end_POSTSUBSCRIPT. After that, we will use the learned CVAE model to encode (compress) and decode (decompress) the voxel data, as illustrated in Fig. 3. It can be observed from the derivative in equations (6) and (7) that, the expected code length of the bits-back coding approach is now can be estimated using the likelihood P𝜽(𝒙𝒛)subscript𝑃𝜽conditional𝒙𝒛P_{\bm{\theta}}(\bm{x\mid z})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ), the prior P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ), and the approximate posterior probability Qϕ(𝒛𝒙)subscript𝑄bold-italic-ϕconditional𝒛𝒙Q_{\bm{\phi}}(\bm{z\mid x})italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z bold_∣ bold_italic_x ), instead of relying on the explicit chain rule of conditional probabilities as in (3), which can produce significant overhead as described earlier.

Note that the use of ANS in bits-back coding as an alternative to arithmetic coding in Fig. 2 is for compatible implementation purpose only. The main difference between using ANS codec and arithmetic coding codec is that the decoding process of ANS is last-in-first-out (LIFO) as described in Fig. 3, while the decoding process of arithmetic coding is first-in-first-out (FIFO) [7, 10]. There is no advantage in compression ratio when utilizing ANS over arithmetic coding. For more details of bits-back coding with ANS codec, please refer to [10] and [11]. In the next section, we evaluate the performance of bits-back coding in compressing large point cloud datasets, and show its superior performance in terms of compression ratios and lower overhead cost of storing and communicating the probabilistic decoder.

III Performance Evaluation

III-A Voxelization and Data Processing

Dataset ShapeNet Sun RGB-D
Shape of training set (8000, 20000, 3) (10335, 20000, 3)
Shape of test set (2000, 20000, 3) (2860, 20000, 3)
Bit-depth (d𝑑ditalic_d) 5, 6, 7 5, 6, 7
Data type float32 float32
Data range [-1.0, 1.0] [-1.0, 1.0]
TABLE I: Main parameters of the ShapeNet dataset [14] and SUN RGB-D dataset [15] used in this paper.

We first describe two point cloud datasets that we use to validate our bits-back coding approach, followed by the compression process as illustrated in Fig. 1. The point cloud datasets are the ShapeNet dataset [14] and SUN RGB-D dataset [15]. The ShapeNet dataset contains various 3D object models in different categories such as tables, cars, chairs, and airplanes. The SUN RGB-D dataset contains various raw point cloud data resulting from the reconstruction of RGB-D images, in which each RGB-D image is a pair of an RGB image and a depth image of an indoor scene. We create synthetic point clouds from these above datasets as follows. From the 3D objects in the ShapeNet dataset, we sample uniformly 20,000 points on the surface of the 3D mesh objects in the dataset. For the RGB-D dataset, we randomly sample 20,000 points from the existing raw point clouds. In other words, each point cloud in our synthetic dataset contains 20,000 points in a 3D coordinator. We use 32 bits of floating-point format (float32) to demonstrate each geometric attribute for each point within the cloud. The final synthetic point cloud dataset is rescaled and normalized within the interval [-1.0, 1.0]. This rescaling and normalization process follows standard practice in common deep learning-based approaches in [3, 4]. The information about the datasets is summarized in Table I.

After having the synthetic point cloud datasets, we can perform spatial processing with simple voxelization as follows. The interval between -1.0 and 1.0 of each x,y,z𝑥𝑦𝑧x,y,zitalic_x , italic_y , italic_z axis are equally divided into 2dsuperscript2𝑑2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT sub-intervals, resulting in M=2d×2d×2d𝑀superscript2𝑑superscript2𝑑superscript2𝑑M=2^{d}\times 2^{d}\times 2^{d}italic_M = 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT voxels. If there is no point falls inside the region of the voxel, the value of that voxel is zero. Otherwise, the voxel’s value is set to 1. By applying the simple voxelization technique above, we can provide ordered input data ready for training our CVAE model. The voxelization technique described above is usually adopted as a processing step before training the entropy model [3, 4] or before building the tree-based compression techniques [2, 12]. For instance, by applying voxelization with bit-depth value d=7𝑑7d=7italic_d = 7 on the ShapeNet’s training set described above, we will have a voxelized dataset with the shape (8000, 128, 128, 128) for training our CVAE model. The same data transformation through voxelization applies to the other test sets. We can view the bit-depth parameter as a resolution adjustment parameter with a higher value, i.e., d=7𝑑7d=7italic_d = 7 results in a higher resolution voxelized point cloud, and vice versa.

III-B Experiment Results

III-B1 Entropy estimation with CVAE model

Refer to caption
Figure 4: Visualization of the original point cloud (first column), voxel representation of the point cloud (second column), reconstructed point cloud from the VAE (third column), and decompressed voxel representation (fifth column). The first and second rows illustrate a data sample from ShapeNet’s test set in different bit-depth values, i.e., d=6𝑑6d=6italic_d = 6 and d=7𝑑7d=7italic_d = 7, respectively. The third and fourth rows illustrate a data sample (a bedroom scene) from the SUN RGB-D’s test set with bit-depth values d=6𝑑6d=6italic_d = 6 and d=7𝑑7d=7italic_d = 7, respectively.

We visualize the main processing steps on point cloud datasets in Fig. 4. For illustration purposes, we take one data sample, i.e., a point cloud of an airplane, from the ShapeNet’s test set and one data sample, i.e., a point cloud of a bedroom scene, from the SUN RGB-D’s test set. In the second column, we observe that the higher value of bit-depth d𝑑ditalic_d results in a better voxel representation of the point cloud. In the third column, we visualize the reconstruction of the voxel data at the output of the CVAE model. The results show that the 3D shape of the objects and scenes are well preserved, but details such as the tail of the airplane and the structure of the bedroom scene are not well recovered. The reason is that the CVAE is optimized based on the variational lower bound of the training data. This means the CVAE will have better generation capabilities, i.e., preserving the 3D shape of the objects and scenes, but lack the fine details of some spatial regions. Note that the blurry voxel reconstruction of the CVAE only affects the compression ratio of the voxel data and since we are using a lossless compression method, the decompressed voxel representation is identical to the original voxel representation, as shown in the third and fifth columns of Fig. 4. This observation suggests that one can potentially develop a more complex entropy estimation model, i.e., building more complex deep neural networks, to gain the 3D spatial details at the third column in Fig. 4, similar to approaches in [3] and [4]. However, building more complex entropy models is not our main focus and we will let this as a potential research direction, as we can always build better models on top of our framework.

III-B2 Impacts of compressing large point cloud dataset

Refer to caption
Figure 5: Compression ratios (measured in average bit-per-point) of the methods on the Shapenet dataset. The lower the bit-per-point is, the lower the compression ratio (i.e., the ratio of compressed output sequence length to uncompressed input length) can be achieved.

Next, we evaluate the performance of the proposed bits-back coding scheme by varying the number of point clouds to be compressed in Fig. 5. We use two other approaches that are “sequential coding” and “Draco” as baselines for comparison. Sequential coding, as described in Section II-B, is the technique to sequentially compress the new data sample into an existing compressed message. This technique assumes that both the encoder and decoder can access the approximate probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ). This can be done by sharing the deep neural network’s parameters for the encoder and decoder. After that, the decoder can sequentially decode the received message by using the pre-trained neural network to produce all conditional probabilities, thus estimating the final marginal probability P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) using the chain rule in (3) [3, 4]. The second baseline is Draco, a compression library developed by Google that utilizes a tree-based entropy estimation [12]. In particular, after the voxelization step, Draco then builds a KD-tree (k-dimensional tree) for removing empty voxels and creating hierarchical regions (nodes), providing a quick way to estimate the entropy of the point cloud across different regions and scales.

In Fig. 5, we show that by increasing the number of point clouds to be compressed, the compression ratios (measured in average bit-per-point) of our bits-back coding approach and the sequential coding approach decrease quickly, meaning that we need a lower number of bits to compress the entire batch of point clouds. For example, by compressing 800 point clouds together, our bits-back coding approach uses 1.56 bits on average to compress a single geometric attribute of the point cloud data. This results in the compressed size of 800 points clouds is 3.1 MB, while the original size of 800 points clouds is 192 MB.

It can be observed that the sequential coding approach achieves a slightly lower compression ratio (lower bit-per-point value) than ours. The reason is that the sequential coding approach [3] [4] assumes that the encoder and decoder have direct access to the estimated probabilistic model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) on each decoding iteration. This can lead to a near-optimal compression ratio. However, this would lead to a significant amount of memory for storing a batch of individual conditional probabilities for encoding/decoding individual point clouds. We will later show that this approach is not suitable for many practical scenarios as the cost for storage or communication of the codecs may exceed the size of the compressed point clouds. On the other hand, the Draco approach achieves a good compression ratio at 1.8 bit-per-point on average and remains unchanged when the number of point clouds increases. This is because the tree-based approach of Draco for entropy estimation does not allow the sequential coding technique to be applied, meaning that the Draco approach can only compress each point cloud individually. As a result, we can observe that the bits-back coding and sequential coding approach achieve lower compression ratios when compressing more than 600 points cloud together.

III-B3 Impacts of bit-depth

As described above, the sequential coding approach [3, 4] can achieve a near-optimal compression ratio because the encoder and decoder have access to the approximate probability model in each compression iteration. In the following, we will analyze the drawbacks of this approach in practical scenarios and emphasize the advantages of our proposed bits-back coding approach. In Fig. 6, we evaluate the impacts of the bit-depth parameter d𝑑ditalic_d on the system performance in terms of compression ratio. In particular, as we increase the bit-depth value from d=5𝑑5d=5italic_d = 5 to d=7𝑑7d=7italic_d = 7, the compression ratio increases as shown in Fig. 6(a). The reason for this trend is the bit-depth parameter indicates the resolution of the voxel representation of the point cloud, as discussed in Section III-B1. As the resolution (i.e., the number of voxels) increases, the compression codec will need to assign a higher number of bits to encode the corresponding point cloud.

Note that the compression ratio on the ShapeNet dataset (left figure of Fig. 6(a)) is higher than that on the SUN RGB-D dataset (right figure). The difference in compression ratio is caused by the geometric patterns, or the difference in entropy, of the point cloud datasets. We observe that the point clouds in the SUN RGB-D dataset are more scattered around the x,y,z𝑥𝑦𝑧x,y,zitalic_x , italic_y , italic_z axes in the 3D space, while the point clouds in the ShapeNet dataset are densely distributed around the central region. The spatial patterns suggest that the uncertainty in the estimated entropy H(𝒙)𝐻𝒙H(\bm{x})italic_H ( bold_italic_x ) result in a difference in the average code length, as described in equation (4).

In addition, we observe that the compression ratio’s scaling trend of the Draco approach is approximately linear, while the bits-back coding and sequential coding approach exhibit worse scaling with a higher value of bit-depth, i.e., d=7𝑑7d=7italic_d = 7. This indicates that the tree-based entropy estimation of Draco is more suitable for higher resolution of voxel representation. Similar findings have been shown in [2]. This suggests that, although our bits-back coding approach achieves a lower compression ratio with lower resolution values compared to Draco, there is still room for improvement. For example, our approach can be extended by incorporating tree-based spatial processing like Octree [2] or KD-tree [12] to achieve a lower compression ratio. However, this approach may further complicate the deep neural network architecture and as we stated earlier, this is out of the scope of our main contribution and can be considered as a fruitful future direction.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: (a) Compression ratios of the methods when bit-depth (resolution) increases, and (b) decoder sizes of the methods when bit-depth increases. All the results are obtained when compressing a batch of 1,000 point clouds.

Next, we evaluate the decoder sizes (measured in MegaBytes) with respect to the bit-depth value in Fig. 6(b). The decoder sizes of the approaches are calculated by the memory storage required to store the decoder of the codec, which is essential for decoding the compressed data. In the case that the sender and the receiver are separated, e.g., transmitting data over the Internet or a communication channel, the decoder may also need to be exchanged to pre-installed at the receiver, alongside the compressed data. As shown in Fig. 6(b), the decoder’s size of the sequential coding approach is significantly higher than that of the bits-back coding and Draco approaches. With the high bit-depth value d=7𝑑7d=7italic_d = 7, the overhead of the decoder size of the sequential coding approach is over 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT MB, while the overhead values of the bits-back coding and Draco approaches are less than 10101010 MB. The reason is that for compressing a batch of 1,000 point clouds together, the sequential coding approach requires access to the probabilistic model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) at each decoding iteration. In other words, the low compression ratio of the sequential coding approach comes from the assumption that the receiver has access to the learned entropy, i.e., P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) model, at each decoding step, which might not be practical as it requires huge storage and communication overhead.

The proposed bits-backs coding approach, on the other hand, does not experience the high overhead cost of the decoder size. The bits-back coding does not rely on the assumption of having access to the marginal probability model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) at the decoding steps. Instead, the bits-back coding decoder utilizes the prior information P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ), likelihood P𝜽(𝒙𝒛)subscript𝑃𝜽conditional𝒙𝒛P_{\bm{\theta}}(\bm{x\mid z})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x bold_∣ bold_italic_z ), and posterior Qϕ(𝒛𝒙)subscript𝑄bold-italic-ϕconditional𝒛𝒙Q_{\bm{\phi}}(\bm{z}\mid\bm{x})italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z ∣ bold_italic_x ) to decode the compressed data. Let’s take the “Decode” operation of the bits-back coding approach in Fig. 3 as the visualized process. First, the decoder retrieves the latent variable model 𝒛𝒛\bm{z}bold_italic_z by decoding the compressed message with the prior P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ). Recall that the prior P(𝒛)𝑃𝒛P(\bm{z})italic_P ( bold_italic_z ) is a simple multivariate normal distribution 𝒩(𝒛;0,𝑰)𝒩𝒛0𝑰\mathcal{N}(\bm{z};0,\bm{I})caligraphic_N ( bold_italic_z ; 0 , bold_italic_I ), where 𝑰𝑰\bm{I}bold_italic_I is the identity matrix of size 50×50505050\times 5050 × 50, and 𝒛𝒛\bm{z}bold_italic_z is the output value of the last layer of the CVAE’s encoder, as illustrated in Fig. 2(b). This means that the latent dimension of our CVAE is 50 and is significantly lower than the dimension of the input/output data (i.e., 2d×2d×2dsuperscript2𝑑superscript2𝑑superscript2𝑑2^{d}\times 2^{d}\times 2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT). Once the latent variable is retrieved, the message can be further decoded by retrieving the variable 𝒙𝒙\bm{x}bold_italic_x through the learned likelihood P𝜽(𝒛)P_{\bm{\theta}}(\cdot\mid\bm{z})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ bold_italic_z ). Finally, the third step is encoding the variable 𝒙𝒙\bm{x}bold_italic_x into the existing message with the learned posterior Qϕ(𝒙)Q_{\bm{\phi}}(\cdot\mid\bm{x})italic_Q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ ∣ bold_italic_x ). As described in the decoding process above, unlike the sequential coding approach, the decoder of the bits-back coding approach does not need the marginal probability model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ), thus resulting in lower overhead cost caused by accessing the marginal probability model at each decoding iteration. By utilizing the lower-dimensional latent variable, along with the posterior during the encoding-decoding process, bits-back coding enables efficient decoding while effectively handling correlations between data samples, especially when compressing large batches of point clouds.

The lowest overhead cost of the Draco approach is reflected by its tree-based spatial processing. In particular, the decoder of the Draco approach may require only the information of the tree data structure such as the head of the KD-tree and the depth of the tree, to fully recover the tree data structure for later decoding steps. For this, the decoding can be done by tracing the paths through the tree to recover the geometric attributes of the point clouds. However, the lightweight capability of Draco’s decoder results in an inefficient compression ratio when encoding a large batch of point clouds as shown in Fig. 5. Notably, our proposed bits-back coding approach might be potentially combined with tree-based spatial processing with an ANS codec like Draco to achieve better compression ratios at high bit-depth values. This may require a complex entropy estimation model, e.g., using multiple deep neural networks together, for encoding and decoding serialized tree data [2].

IV Conclusion

In this work, we have proposed a novel bits-back coding method for compressing large point cloud datasets. The novelty of the approach is based on the bits-back coding method, which utilizes the latent variable model over the model parameters to encode and decode point cloud data. By utilizing the prior, likelihood, and approximate posterior of the CVAE model during the decoding step, our approach achieves a competitive compression ratio compared to the sequential coding approach with deep neural networks. The use of the bits-back coding method enables us to achieve a comparable compression ratio to the other sequential coding with deep learning models while consuming lower memory and communication costs. Notably, the bits-back coding approach does not require access to the learned entropy model P𝜽(𝒙)subscript𝑃𝜽𝒙P_{\bm{\theta}}(\bm{x})italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) during the decoding process, thus enabling lightweight and practical decoder implementation. Future research directions could be adopting advanced spatial processing techniques, such as building tree-based models, using posterior estimation with flow-based models, and building deeper neural networks for better entropy estimation.

Appendix A Architecture of the Proposed CVAE

In particular, the CVAE’s encoder consists of three Convo3D (3D convolution) layers, followed by two fully-connected layers. The main idea of using stacked Convo3D layers is to subsequently reduce the dimensions of the input data before feeding it into the fully connected layers. For high-dimensional data like our voxels, where the bit-depth value d=7𝑑7d=7italic_d = 7 results in dimensions of M=27×27×27𝑀superscript27superscript27superscript27M=2^{7}\times 2^{7}\times 2^{7}italic_M = 2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, the input data expands to a flattened array with 221superscript2212^{21}2 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT features (more than 2 million). Processing such a large flattened array using a vanilla VAE with only fully-connected layers becomes impractical due to the immense computational demand. Instead, by using the Convo3D layers, the input dimension is reduced from M𝑀Mitalic_M to 2,000 features before forwarding these 2,000 features into the fully connected layers. The parameters of each Convo3D and fully-connected layer are illustrated in detail in Fig. 2(b). In this work, our implementation of different bit-depth values with d=5𝑑5d=5italic_d = 5, d=6𝑑6d=6italic_d = 6, and d=7𝑑7d=7italic_d = 7, with only simple modifications of the kernel size and stride parameters of the Convo3D layers. For higher bit-depth (i.e., d>7𝑑7d>7italic_d > 7), it is possible to further implement such bigger and deeper models. In return, it would require large computing power and running time while does not provide further insights from our work, which extensively focuses on investigating the potential of the bits-back coding.

After passing the M𝑀Mitalic_M voxel features through the CVAE’s encoder, the number of features is reduced to 500, which is used for constructing a latent space model that consists of 50 features. The addition module before the CVAE’s latent space in Fig. 2(b) is to illustrate the reparameterization trick, in which the latent space is subjected to a Gaussian prior, i.e., P(𝒛)𝒩(𝒛;0,𝐈)similar-to𝑃𝒛𝒩𝒛0𝐈P(\bm{z})\sim\mathcal{N}(\bm{z};0,\mathbf{I})italic_P ( bold_italic_z ) ∼ caligraphic_N ( bold_italic_z ; 0 , bold_I ) [13]. Following the latent space layer are two fully connected layers with the same sizes as the ones of the CVAE’s encoder. Finally, the last three layers are the ConvoTran3D (3D transposed convolution) layers. The ConvoTran3D layers are utilized to subsequently increase the dimensional features of the input vector, so that the final output of the CVAE model has the same dimension, i.e., M𝑀Mitalic_M, as the input vector. We then train the CVAE for the two datasets described in Table I. The training process is done by using Adam optimizer with a learning rate of 0.001. The loss function is conventional ELBO in (8). The activation function used across the hidden layers is ReLu. The final output layer’s activation function is Sigmoid as we are modeling the binarized voxel data. The Sigmoid output of the CVAE model then can be used as a parametric Bernoulli distribution to produce the voxels having values of either 0 or 1. We train the CVAE model for 500 training epochs. After training, the CVAE model can produce an estimated probability for the input voxel as shown in Fig. 4. Our details of implementation and reproducibility can be found at https://github.com/hieunq95/gpcc-bits-back.

References

  • [1] C. Cao, M. Preda, and T. Zaharia, “3d point cloud compression: A survey,” in Proceedings of the 24th International Conference on 3D Web Technology, Jul. 2019, pp. 1–9.
  • [2] L. Huang, S. Wang, K. Wong, J. Liu, and R. Urtasun, “Octsqueeze: Octree-structured entropy model for lidar compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp. 1313–1323.
  • [3] D. T. Nguyen, M. Quach, G. Valenzise, and P. Duhamel, “Learning-based lossless compression of 3d point cloud geometry,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, Jun. 2021, pp. 4220–4224.
  • [4] J. Wang, H. Zhu, H. Liu, and Z. Ma, “Lossy point cloud geometry compression via end-to-end learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4909–4923, Jun. 2021.
  • [5] Y. He, X. Ren, D. Tang, Y. Zhang, X. Xue, and Y. Fu, “Density-preserving deep point cloud compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2333–2342.
  • [6] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.
  • [7] J. Duda, “Asymmetric numeral systems: Entropy coding combining speed of huffman coding with compression rate of arithmetic coding,” arXiv preprint arXiv:1311.2540, 2013.
  • [8] D. J. MacKay, Information Theory, Inference and Learning Algorithms.   Cambridge university press, 2003.
  • [9] B. J. Frey and G. E. Hinton, “Free energy coding,” in Proceedings of Data Compression Conference.   IEEE, Mar. 1996, pp. 73–81.
  • [10] J. Townsend, T. Bird, and D. Barber, “Practical lossless compression with latent variables using bits back coding,” in International Conference on Learning Representations, May 2019.
  • [11] J. Townsend, T. Bird, J. Kunze, and D. Barber, “Hilloc: lossless image compression with hierarchical latent variable models,” in International Conference on Learning Representations, Apr. 2020.
  • [12] F. Galligan, M. Hemmer, O. Stava, F. Zhang, and J. Brettle, “Google/draco: A library for compressing and decompressing 3d geometric meshes and point clouds,” 2018. [Online]. Available: https://github.com/google/draco
  • [13] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations, Apr. 2014.
  • [14] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3D model repository,” Stanford University — Princeton University — Toyota Technological Institute at Chicago, Tech. Rep. arXiv:1512.03012 [cs.GR], 2015.
  • [15] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp. 567–576.