Unsupervised Object Detection
with Theoretical Guarantees

Marian Longa
Visual Geometry Group
University of Oxford
mlonga@robots.ox.ac.uk
&João F. Henriques
Visual Geometry Group
University of Oxford
joao@robots.ox.ac.uk
Abstract

Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method’s prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.

1 Introduction

Unsupervised object detection using deep neural networks is a long-standing area of research at the intersection of machine learning and computer vision. Its aim is to learn to detect objects from images without the use of training labels. Learning without supervision has multiple advantages, as obtaining labels for training data is often costly and time consuming, and in some cases may be impractical or unethical. For example, in medical imaging, unsupervised object detection can help save specialists’ time by automatically flagging suspicious abnormalities schlegl2017anogan , and in autonomous driving it may help automatically detect pedestrians on a collision course with the vehicle dairi2018obstacledetection . It is thus important to understand and develop better unsupervised object detection methods.

While successful, current object detection methods are often empirical and possess few to no guarantees about their learned representations. In this work we aim to address this gap by designing the first unsupervised object detection method that we prove is guaranteed to learn the true object positions up to small shifts, and performing a detailed analysis of how the maximum errors of the learned object positions depend on the encoder and decoder receptive field sizes, the object sizes, and the sizes of the Gaussians used for rendering. This is especially important in sensitive domains such as medicine, where incorrectly detecting an object could be costly. Our method guarantees to detect any object that moves in a video or that appears at different locations in images, as long as the objects are distinct and the images are reconstructed correctly.

We base our unsupervised object detection method on an autoencoder with a convolutional neural network (CNN) encoder and decoder, and modify it to make it exactly translationally equivariant (sec. 3). This allows us to interpret the latent variables as object positions and lets us train the network without supervision. We then use the equivariance property to formulate and prove a theorem that relates the maximum position error of the learned latent variables to the size of the encoder and decoder receptive fields, the size of the objects, and the width of the Gaussian used in the decoder (sec. 4). Next, we derive corollaries describing the exact form of the maximum position error as a function of these four variables. These corollaries can be used as guidelines when designing unsupervised object detection networks, as they describe the guarantees of the learned object positions that can be obtained under different settings. We then perform synthetic and CLEVR-based johnson2017clevr experiments to validate our theory (sec. 5). Finally, we discuss the implications of our results for designing reliable object detection methods (sec. 6).

Concretely, the contributions of this paper are:

  1. 1.

    An unsupervised object detection method that is guaranteed to learn the true object positions up to small shifts.

  2. 2.

    A proof and detailed theoretical analysis of how the maximum position error of the method depends on the encoder and decoder receptive field sizes, object sizes, and widths of the Gaussians used in the rendering process.

  3. 3.

    Synthetic experiments, CLEVR-based experiments, and real video experiments validating our theoretical results up to precisions of individual pixels.

2 Related Work

Object Detection.

Object detection is an area of research in computer vision and machine learning, dealing with the detection and location of objects in images. Popular supervised approaches to object detection include Segment Anything (SAM) kirillov2023sam , Mask R-CNN he2017maskrcnn , U-Net ronneberger2015unet and others goldman2019precise . While successful, these methods typically require large amounts of annotated segmentation masks and bounding boxes, which may be costly or impossible to obtain in certain applications. Popular unsupervised and self-supervised object detection methods include CutLER wang2023cutler , Slot Attention locatello2020slotattention , MoNet burgess2019monet and others greff2019multi . These methods aim to learn object-centric representations for object detection and segmentation without using training labels. Finally, unsupervised object localisation methods such as FOUND simeoni2023found and others simeoni2024survey aim to localise objects in images, typically using vision transformer (ViT) self-supervised features. Compared to both current supervised and unsupervised object detection and localisation methods, our work is the only one that has provable theoretical guarantees of recovering the true object positions up to small shifts. It also requires no supervision.

Identifiability in Representation Learning.

Identifiability in representation learning refers to the issue of being able to learn a latent representation that uniquely corresponds to the true underlying latent representation used in the data generation process. Some recent works aim to reduce the space of indeterminacies of the learned representations, and thus achieve identifiability, by incorporating various assumptions into their models. Xi et al. xi2023identifiabilitycharacterization categorise these assumptions for generative models into constraints on the distribution of the latent variables and constraints on the generator function. Some of their categories include non-Gaussianity of the latent distribution shimizu2006lingam , dependence on an auxiliary variable hyvarinen2016nonlinearica ; hyvarinen_td_2017 , use of multiple views locatello2020multipleviews , use of interventions brehmer2022interventions ; lippe2022citris , use of mechanism sparsity lachapelle2022sparsity , and restrictions on the Jacobian of the generator gresele2021ima . In contrast, in our work we achieve identifiability by making our network equivariant to translations, imposing an interpretable latent space structure, and requiring the data to obey our theorem’s assumptions.

Refer to caption

Figure 1: Network architecture. Encoder: (1) an image x𝑥xitalic_x is passed through a CNN ψ𝜓\psiitalic_ψ to obtain n𝑛nitalic_n embedding maps e1,,ensubscript𝑒1subscript𝑒𝑛e_{1},...,e_{n}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, (2) a maximum of each map is found using softargmax to obtain latent variables [z1,x,z1,y,,zn,x,zn,y]subscript𝑧1𝑥subscript𝑧1𝑦subscript𝑧𝑛𝑥subscript𝑧𝑛𝑦[z_{1,x},z_{1,y},...,z_{n,x},z_{n,y}][ italic_z start_POSTSUBSCRIPT 1 , italic_x end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 , italic_y end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_n , italic_x end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_n , italic_y end_POSTSUBSCRIPT ]. Decoder: (1) Gaussians e^1,,e^nsubscript^𝑒1subscript^𝑒𝑛\hat{e}_{1},...,\hat{e}_{n}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are rendered at the positions given by the latent variables, (2) the Gaussian maps are concatenated with positional encodings and passed through a CNN ϕitalic-ϕ\phiitalic_ϕ to obtain the predicted image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. Finally, x𝑥xitalic_x and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG are used to compute reconstruction loss (x^,x)^𝑥𝑥\mathcal{L}(\hat{x},x)caligraphic_L ( over^ start_ARG italic_x end_ARG , italic_x ).

3 Method

In this section we describe the proposed method for unsupervised object detection with guarantees. On a high level, our architecture is based on an autoencoder that is fully equivariant to translations, which we achieve by making the encoder consist of a CNN followed by a soft argmax function to extract object positions, and making the decoder consist of a Gaussian rendering function followed by another CNN to reconstruct an image from the object positions (fig. 1). In the following sections we describe the different parts of the architecture in detail.

Autoencoder with CNN Encoder and Decoder.

We start with an autoencoder, a standard unsupervised representation learning model, consisting of an encoder network ψ𝜓\psiitalic_ψ that maps an image x𝑥xitalic_x to a low-dimensional latent variable z𝑧zitalic_z, followed by a decoder network ϕitalic-ϕ\phiitalic_ϕ that maps this variable back to an image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, with the objective of minimising the difference between x𝑥xitalic_x and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. Typically, the encoder and decoder networks are parametrised by multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs) paired with fully-connected (FC) layers, however neither of these parametrisations by default can guarantee that the learned latent variables will correspond to the true object positions (because of the universal approximation ability of MLPs and FC layers hornik1989multilayer ). To obtain such guarantees, we would thus like to modify the autoencoder to make it exactly translationally equivariant, that is, a shift of an object in the input image x𝑥xitalic_x should correspond to a proportional shift of the latent variable z𝑧zitalic_z, and a shift of the latent variable z𝑧zitalic_z should correspond to a shift in the predicted image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG.

We start with an autoencoder where the encoder and decoder are both CNNs. CNNs consist of layers computing the convolution between a feature map x𝑥xitalic_x and a filter F𝐹Fitalic_F, defined in one dimension as

(xF)[i]=jx[j]F[ji]𝑥𝐹delimited-[]𝑖subscript𝑗𝑥delimited-[]𝑗𝐹delimited-[]𝑗𝑖(x\star F)[i]=\sum\nolimits_{j}x[j]F[j-i]( italic_x ⋆ italic_F ) [ italic_i ] = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x [ italic_j ] italic_F [ italic_j - italic_i ] (1)

Intuitively, this corresponds to sliding the filter F𝐹Fitalic_F across the feature map x𝑥xitalic_x and at each position of the filter i𝑖iitalic_i computing the dot product between the feature map x𝑥xitalic_x and the filter F𝐹Fitalic_F. We can prove that convolutional layers are equivariant to translations, since

((τx)F)[i]=jx[jt]F[ji]=jx[j]F[j(it)]=τ(xF)[i]𝜏𝑥𝐹delimited-[]𝑖subscript𝑗𝑥delimited-[]𝑗𝑡𝐹delimited-[]𝑗𝑖subscript𝑗𝑥delimited-[]𝑗𝐹delimited-[]𝑗𝑖𝑡𝜏𝑥𝐹delimited-[]𝑖\displaystyle((\tau\circ x)\star F)[i]=\sum_{j}x[j-t]F[j-i]=\sum_{j}x[j]F[j-(i% -t)]=\tau\circ(x\star F)[i]( ( italic_τ ∘ italic_x ) ⋆ italic_F ) [ italic_i ] = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x [ italic_j - italic_t ] italic_F [ italic_j - italic_i ] = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x [ italic_j ] italic_F [ italic_j - ( italic_i - italic_t ) ] = italic_τ ∘ ( italic_x ⋆ italic_F ) [ italic_i ] (2)

where τ𝜏\tauitalic_τ is the translation operator that translates a feature map by t𝑡titalic_t pixels, and we have used the substitution jj+t𝑗𝑗𝑡j\rightarrow j+titalic_j → italic_j + italic_t at the second equality. Therefore, the encoder and decoder are both equivariant to translations, but this property only holds for translations of feature maps (i.e. spatial tensors).

From Encoder Feature Maps to Latent Variables.

So far we have only worked with images and feature maps, but the latter do not directly express positions of any detected objects. It would be preferable to convert these feature maps into scalar variables that can be interpreted as object positions that are equivariant to image translations. To do this, we first define a translation τ𝜏\tauitalic_τ of a (1D) feature map x𝑥xitalic_x and a translation τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of a scalar z𝑧zitalic_z as

τ(x)[i]=x[it],τ(z)=z+tformulae-sequence𝜏𝑥delimited-[]𝑖𝑥delimited-[]𝑖𝑡superscript𝜏𝑧𝑧𝑡\tau(x)[i]=x[i-t],\quad\tau^{\prime}(z)=z+titalic_τ ( italic_x ) [ italic_i ] = italic_x [ italic_i - italic_t ] , italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z ) = italic_z + italic_t (3)

where i𝑖iitalic_i is the position in the feature map x𝑥xitalic_x, τ𝜏\tauitalic_τ shifts an image by t𝑡titalic_t pixels, and τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT shifts a scalar by t𝑡titalic_t units. To relate translations in feature maps to translations in latent variables, we can use a function that computes a scalar property of a feature map x𝑥xitalic_x, such as argmaxargmax\mathrm{argmax}roman_argmax, defined as argmax(x)={i:x[j]x[i]j}argmax𝑥conditional-set𝑖𝑥delimited-[]𝑗𝑥delimited-[]𝑖for-all𝑗\mathrm{argmax}(x)=\{i:x[j]\leq x[i]\ \forall j\}roman_argmax ( italic_x ) = { italic_i : italic_x [ italic_j ] ≤ italic_x [ italic_i ] ∀ italic_j }. Using these definitions we can now prove the equivariance of argmaxargmax\mathrm{argmax}roman_argmax, i.e. that shifting the feature map x𝑥xitalic_x by τ𝜏\tauitalic_τ corresponds to shifting the latent variable argmax(x)argmax𝑥\mathrm{argmax}(x)roman_argmax ( italic_x ) by τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

argmax(τx)argmax𝜏𝑥\displaystyle\mathrm{argmax}(\tau\circ x)roman_argmax ( italic_τ ∘ italic_x ) ={i:τx[j]τx[i]j}={i:x[jt]x[it]j}absentconditional-set𝑖𝜏𝑥delimited-[]𝑗𝜏𝑥delimited-[]𝑖for-all𝑗conditional-set𝑖𝑥delimited-[]𝑗𝑡𝑥delimited-[]𝑖𝑡for-all𝑗\displaystyle=\{i:\tau\circ x[j]\leq\tau\circ x[i]\ \;\forall j\}=\{i:x[j-t]% \leq x[i-t]\ \;\forall j\}= { italic_i : italic_τ ∘ italic_x [ italic_j ] ≤ italic_τ ∘ italic_x [ italic_i ] ∀ italic_j } = { italic_i : italic_x [ italic_j - italic_t ] ≤ italic_x [ italic_i - italic_t ] ∀ italic_j }
={i+t:x[j]x[i]j}=argmax(x)+t=τargmax(x)absentconditional-set𝑖𝑡𝑥delimited-[]𝑗𝑥delimited-[]𝑖for-all𝑗argmax𝑥𝑡superscript𝜏argmax𝑥\displaystyle=\{i+t:x[j]\leq x[i]\ \;\forall j\}=\mathrm{argmax}(x)+t=\tau^{% \prime}\circ\mathrm{argmax}(x)= { italic_i + italic_t : italic_x [ italic_j ] ≤ italic_x [ italic_i ] ∀ italic_j } = roman_argmax ( italic_x ) + italic_t = italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∘ roman_argmax ( italic_x ) (4)

where at the first equality we use the definition of argmaxargmax\mathrm{argmax}roman_argmax, at the second equality we use the definition of τ𝜏\tauitalic_τ (eq. 3, left), at the third equality we use the substitution ii+t𝑖𝑖𝑡i\rightarrow i+titalic_i → italic_i + italic_t, at the fourth equality we use the definition of argmaxargmax\mathrm{argmax}roman_argmax, and at the last equality we use the definition of τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (eq. 3, right).

However, because the argmax operation is not differentiable, for neural network training we approximate it via a differentiable soft argmax function, defined in 2D as

softargmax(x)softargmax𝑥\displaystyle\mathrm{softargmax}(x)roman_softargmax ( italic_x ) =(1Ii=0I1j=0J1(i+12)σ1(xΘ)[i,j],1Ji=0I1j=0J1(j+12)σ2(xΘ)[i,j])absent1𝐼superscriptsubscript𝑖0𝐼1superscriptsubscript𝑗0𝐽1𝑖12subscript𝜎1𝑥Θ𝑖𝑗1𝐽superscriptsubscript𝑖0𝐼1superscriptsubscript𝑗0𝐽1𝑗12subscript𝜎2𝑥Θ𝑖𝑗\displaystyle=\Biggl{(}\frac{1}{I}\sum_{i=0}^{I-1}\sum_{j=0}^{J-1}\left(i+% \frac{1}{2}\right)\ \sigma_{1}\left(\frac{x}{\Theta}\right)[i,j],\ \frac{1}{J}% \sum_{i=0}^{I-1}\sum_{j=0}^{J-1}\left(j+\frac{1}{2}\right)\ \sigma_{2}\left(% \frac{x}{\Theta}\right)[i,j]\Biggr{)}= ( divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT ( italic_i + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_x end_ARG start_ARG roman_Θ end_ARG ) [ italic_i , italic_j ] , divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT ( italic_j + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_x end_ARG start_ARG roman_Θ end_ARG ) [ italic_i , italic_j ] ) (5)

where σ𝜎\sigmaitalic_σ is the softmax function defined in one dimension as σ(x)[i]=exp(x[i])/jexp(x[j])𝜎𝑥delimited-[]𝑖𝑥delimited-[]𝑖subscript𝑗𝑥delimited-[]𝑗\sigma(x)[i]=\exp(x[i])/\sum_{j}\exp(x[j])italic_σ ( italic_x ) [ italic_i ] = roman_exp ( italic_x [ italic_i ] ) / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_x [ italic_j ] ), σ1(x)subscript𝜎1𝑥\sigma_{1}(x)italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and σ2(x)subscript𝜎2𝑥\sigma_{2}(x)italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) is the softmax function evaluated along the first and second dimensions of x𝑥xitalic_x, ΘΘ\Thetaroman_Θ is a temperature hyperparameter, [i,j]𝑖𝑗[i,j][ italic_i , italic_j ] is the image index, I𝐼Iitalic_I is the image width, J𝐽Jitalic_J is the image height, and the term 1/2121/21 / 2 ensures that the densities correspond to pixel centres. As the temperature ΘΘ\Thetaroman_Θ in eq. 5 approaches zero, softargmaxsoftargmax\mathrm{softargmax}roman_softargmax reduces to the classical argmaxargmax\mathrm{argmax}roman_argmax function.

From Latent Variables to Decoder Feature Maps.

Similar to mapping from encoder feature maps to latent variables, we would now like to relate shifts in latent variables z𝑧zitalic_z to shifts of decoder feature maps x𝑥xitalic_x. To do this, we can invert the action of the argmaxargmax\mathrm{argmax}roman_argmax operation. Because argmaxargmax\mathrm{argmax}roman_argmax is a many-to-one function, finding an exact inverse is not possible, but we can obtain a pseudo-inverse using the Dirac delta function defined as delta(z)[i]=δ(iz)delta𝑧delimited-[]𝑖𝛿𝑖𝑧\mathrm{delta}(z)[i]=\delta(i-z)roman_delta ( italic_z ) [ italic_i ] = italic_δ ( italic_i - italic_z ). We can show that deltadelta\mathrm{delta}roman_delta is a pseudo-inverse of argmaxargmax\mathrm{argmax}roman_argmax because argmaxdeltaz=i:δ(jz)δ(iz);j=z:argmaxdelta𝑧𝑖formulae-sequence𝛿𝑗𝑧𝛿𝑖𝑧for-all𝑗𝑧\mathrm{argmax}\circ\mathrm{delta}\circ z={i:\delta(j-z)\leq\delta(i-z)\ ;% \forall j}=zroman_argmax ∘ roman_delta ∘ italic_z = italic_i : italic_δ ( italic_j - italic_z ) ≤ italic_δ ( italic_i - italic_z ) ; ∀ italic_j = italic_z. Now, similar to the argmaxargmax\mathrm{argmax}roman_argmax function, we can prove that the deltadelta\mathrm{delta}roman_delta function is equivariant to the latent variable shift τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on the input and the feature map shift τ𝜏\tauitalic_τ on the output, i.e.

delta(τz)[i]=δ(iτz)=δ(izt)=delta(z)[it]=τdelta(z)[i]deltasuperscript𝜏𝑧delimited-[]𝑖𝛿𝑖superscript𝜏𝑧𝛿𝑖𝑧𝑡delta𝑧delimited-[]𝑖𝑡𝜏delta𝑧delimited-[]𝑖\displaystyle\mathrm{delta}(\tau^{\prime}\circ z)[i]=\delta(i-\tau^{\prime}% \circ z)=\delta(i-z-t)=\mathrm{delta}(z)[i-t]=\tau\circ\mathrm{delta}(z)[i]roman_delta ( italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∘ italic_z ) [ italic_i ] = italic_δ ( italic_i - italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∘ italic_z ) = italic_δ ( italic_i - italic_z - italic_t ) = roman_delta ( italic_z ) [ italic_i - italic_t ] = italic_τ ∘ roman_delta ( italic_z ) [ italic_i ] (6)

where at the first equality we have used the definition of deltadelta\mathrm{delta}roman_delta, at the second equality we have used the definition of τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (eq. 3, right), at the third equality we have used the definition of deltadelta\mathrm{delta}roman_delta, and at the last equality we have used the definition of τ𝜏\tauitalic_τ (eq. 3, left).

Again, because the deltadelta\mathrm{delta}roman_delta function is not differentiable, we can approximate it using a differentiable renderrender\mathrm{render}roman_render function as

render(z)[i]=𝒩(iz,σ2)render𝑧delimited-[]𝑖𝒩𝑖𝑧superscript𝜎2\mathrm{render}(z)[i]=\mathcal{N}(i-z,\sigma^{2})roman_render ( italic_z ) [ italic_i ] = caligraphic_N ( italic_i - italic_z , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (7)

where 𝒩(iz,σ2)𝒩𝑖𝑧superscript𝜎2\mathcal{N}(i-z,\sigma^{2})caligraphic_N ( italic_i - italic_z , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is a Gaussian evaluated at iz𝑖𝑧i-zitalic_i - italic_z with variance given by the hyperparameter σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. As the variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in eq. 7 approaches zero, the renderrender\mathrm{render}roman_render function reduces to the hard deltadelta\mathrm{delta}roman_delta function.

Additionally, because the decoder is translationally equivariant, we also condition it on positional encodings of the size of the images to provide it with sufficient information to reconstruct different parts of the background, assuming the background is static. Alternatively, if background is varying, the decoder can be conditioned on a randomly-sampled nearby video frame instead, which will provide information about the background but not the objects’ positions (following Jakab et al. jakab2018keypoints ).

We also note that since the latent variables z𝑧zitalic_z are ordered, this allows the encoder and decoder to learn to associate each variable with the semantics of each object and achieve successful reconstruction.

We thus now have all the elements we need to create an equivariant architecture where the encoder and decoder are defined, respectively, by

z=softargmaxψx,x^t=ϕrenderzt.formulae-sequence𝑧softargmax𝜓𝑥subscript^𝑥𝑡italic-ϕrendersubscript𝑧𝑡\displaystyle z=\mathrm{softargmax}\circ\psi\circ x,\quad\hat{x}_{t}=\phi\circ% \mathrm{render}\circ z_{t}.italic_z = roman_softargmax ∘ italic_ψ ∘ italic_x , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ϕ ∘ roman_render ∘ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (8)

This is shown in fig. 1. Having designed an exactly translationally equivariant architecture now allows us to obtain theoretical guarantees about the learned latent variables, which we discuss next. 111We note that a similar architecture was proposed by Jakab et al. jakab2018keypoints , with empirical success in keypoint detection. However, we derive our architecture by enforcing strict translation equivariance properties, which makes our theoretical results possible.

4 Theoretical Results

In this section we present our main theorem stating the maximum bound on the position errors of the latent variables learned with our method in terms of the encoder and decoder receptive field sizes, the object size, and the Gaussian standard deviation (thm. 4.1). We continue by deriving specialised corollaries relating the maximum position error to the encoder receptive field size (cor. 4.2), decoder receptive field size (cor. 4.3), object size (cor. 4.4), and Gaussian standard deviation (cor. 4.5).

Refer to caption
(a) Encoder Error
Refer to caption
(b) Decoder Error
Figure 2: Position errors. (a) Maximum position error due to encoder, given by sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1. The maximum error occurs when the encoder and the object are as far away from each other as possible while still overlapping by one pixel. (b) Maximum position error due to decoder, given by sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. The maximum error occurs when some part of the Gaussian at position z+ΔG𝑧subscriptΔ𝐺z+\Delta_{G}italic_z + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is within the decoder receptive field (RF) but is as far away from the rendered object as possible.
Theorem 4.1 (Error Bound).

Consider a set of images xXsimilar-to𝑥𝑋x\sim Xitalic_x ∼ italic_X with objects of size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, CNN encoder ψ𝜓\psiitalic_ψ with receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, CNN decoder ϕitalic-ϕ\phiitalic_ϕ with receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, soft argmax function softargmaxsoftargmax\mathrm{softargmax}roman_softargmax, rendering function renderrender\mathrm{render}roman_render with Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and ΔG𝒩(0,σG2)similar-tosubscriptΔ𝐺𝒩0superscriptsubscript𝜎𝐺2\Delta_{G}\sim\mathcal{N}(0,\sigma_{G}^{2})roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and latent variables z𝑧zitalic_z, composed as z=softargmaxψx𝑧softargmax𝜓𝑥z=\mathrm{softargmax}\circ\psi\circ xitalic_z = roman_softargmax ∘ italic_ψ ∘ italic_x and x^=ϕrenderz^𝑥italic-ϕrender𝑧\hat{x}=\phi\circ\mathrm{render}\circ zover^ start_ARG italic_x end_ARG = italic_ϕ ∘ roman_render ∘ italic_z (fig. 1). Assuming (1) the objects are reconstructed at the same positions as in the original images, (2) each object appears in at least two different positions in the dataset, and (3) there are no two identical objects in any image, then the learned latent variables z𝑧zitalic_z correspond to the true object positions up to object permutations and maximum position errors ΔΔ\Deltaroman_Δ of

Δ=min(sψ2+so21,sϕ2so2+ΔG).Δsubscript𝑠𝜓2subscript𝑠𝑜21subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺\Delta=\min\left(\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1,\frac{s_{\phi}}{2}-\frac% {s_{o}}{2}+\Delta_{G}\right).roman_Δ = roman_min ( divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 , divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) . (9)

For proof see appendix A. Intuitively, the assumptions ensure that each latent variable corresponds to the position of each object in the image. The error in the learned object positions then arises from both the encoding and decoding process. In the encoding process, the maximum error occurs when the encoder and the object are as far away from each other as possible while still overlapping, i.e. sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1 (fig. 2(a)). Conversely, in the decoding process, the maximum error occurs when the rendered object and the latent variable are as far away from each other as possible while both still being inside the decoder receptive field, i.e. sϕso/2subscript𝑠italic-ϕsubscript𝑠𝑜2s_{\phi}-s_{o}/2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 (fig. 2(b)). Additionally, there is an extra error of ΔGsubscriptΔ𝐺\Delta_{G}roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as the latent variable is rendered by a Gaussian and the decoder can capture any part of this Gaussian. Finally, because we assume each object is reconstructed at the same position as in the original image, the errors from the encoder and decoder must cancel each other out. Therefore, the overall maximum position error is given by the lower of the two expressions for the encoder and the decoder, leading to eq. 9. Next, we present corollaries relating this error bound to different factors.

Corollary 4.2 (Error Bound vs. Encoder RF Size).

The maximum position error as a function of the encoder receptive field (RF) size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT for a given sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is

Δ(sψ)={sψ2+so21if 1sψsϕ2so+2,sϕ2so2+ΔGifsψ>sϕ2so+2.Δsubscript𝑠𝜓casessubscript𝑠𝜓2subscript𝑠𝑜21if1subscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺ifsubscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2\Delta(s_{\psi})=\begin{cases}\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1&\mathrm{if% \ \ }1\leq s_{\psi}\leq s_{\phi}-2s_{o}+2,\\ \frac{s_{\phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}&\mathrm{if\ \ }s_{\psi}>s_{\phi}% -2s_{o}+2.\end{cases}roman_Δ ( italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if 1 ≤ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2 . end_CELL end_ROW

For an illustration see fig. 3(a). There are two regions of the curve (separated by a dashed line). In the left-most region, sψ<sϕ2so+2subscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\psi}<s_{\phi}-2s_{o}+2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2, the error is dominated by the encoder error, and in the right-most region, sψsϕ2so+2subscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\psi}\geq s_{\phi}-2s_{o}+2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2, the error is dominated by the decoder error. Initially, for sψ=1subscript𝑠𝜓1s_{\psi}=1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 1, the error is given by so/21/2subscript𝑠𝑜212s_{o}/2-1/2italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1 / 2, because the 1×1111\times 11 × 1 px encoder can match any pixel that is part of the object and so can be at most half of the object size away from the true object position that is at the centre of the object. As the encoder RF size increases up to sϕ2so+2subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\phi}-2s_{o}+2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2, the position error increases linearly with it as sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1, because now any part of the encoder RF can match any part of the object (fig. 2(a)). This bound is deterministic due to the deterministic encoding process.

At sψ=sϕ2so+2subscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\psi}=s_{\phi}-2s_{o}+2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2 (vertical dashed line in fig. 3(a)), the maximum errors from encoder and decoder both become equal to sϕ/2so/2subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\phi}/2-s_{o}/2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2. For sψ>sϕ2so+2subscript𝑠𝜓subscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\psi}>s_{\phi}-2s_{o}+2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 2, the position error is dominated by the error from the decoder which is constant at sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT with ΔG𝒩(0,σG2)similar-tosubscriptΔ𝐺𝒩0superscriptsubscript𝜎𝐺2\Delta_{G}\sim\mathcal{N}(0,\sigma_{G}^{2})roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and so even though the encoder RF size is increasing, this has no effect as the limiting factor is now the decoder. Due to the Gaussian rendering step in the decoding process, this bound is now probabilistic, and is distributed normally with variance σG2superscriptsubscript𝜎𝐺2\sigma_{G}^{2}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.The results of corollary 4.2 can be extended to multiple objects with different sizes (see appendix B, cor. B.1)

Refer to caption
(a) Error vs. Encoder RF.
Refer to caption
(b) Error vs. Decoder RF.
Refer to caption
(c) Error vs. Object Size.
Refer to caption
(d) Error vs. Gaussian S.D.
Figure 3: Theoretical bounds for the maximum position error as a function of the encoder receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, as the remaining factors are fixed. Each bound consists of a region due to the encoder error (solid line) and the decoder error (probabilistic bound). Standard deviations are represented by shades of blue.
Corollary 4.3 (Error Bound vs. Decoder RF Size).

The maximum position error as a function of the decoder receptive field (RF) size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT for a given sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is

Δ(sϕ)={sϕ2so2+ΔGifsosϕ<sψ+2so2,sψ2+so21ifsϕsψ+2so2.Δsubscript𝑠italic-ϕcasessubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺ifsubscript𝑠𝑜subscript𝑠italic-ϕsubscript𝑠𝜓2subscript𝑠𝑜2subscript𝑠𝜓2subscript𝑠𝑜21ifsubscript𝑠italic-ϕsubscript𝑠𝜓2subscript𝑠𝑜2\Delta(s_{\phi})=\begin{cases}\frac{s_{\phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}&% \mathrm{if\ \ }s_{o}\leq s_{\phi}<s_{\psi}+2s_{o}-2,\\ \frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1&\mathrm{if\ \ }s_{\phi}\geq s_{\psi}+2s_{% o}-2.\end{cases}roman_Δ ( italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2 . end_CELL end_ROW

For an illustration see fig. 3(b). Similar to corollary 4.2, there are two regions of the curve, one for sϕ<sψ+2so2subscript𝑠italic-ϕsubscript𝑠𝜓2subscript𝑠𝑜2s_{\phi}<s_{\psi}+2s_{o}-2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2 (left), where the error is dominated by the decoder error, and another for sϕsψ+2so2subscript𝑠italic-ϕsubscript𝑠𝜓2subscript𝑠𝑜2s_{\phi}\geq s_{\psi}+2s_{o}-2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2 (right), where the error is dominated by the encoder error. Note that this is opposite to cor. 4.2. Initially, for sϕ=sosubscript𝑠italic-ϕsubscript𝑠𝑜s_{\phi}=s_{o}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the decoder receptive field has the same size as the object, and so to achieve perfect reconstruction it needs to be at the same position as the object, resulting in 00 position error plus any error ΔGsubscriptΔ𝐺\Delta_{G}roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT caused by the non-zero width of the Gaussian. As the decoder RF size increases up to sψ+2so2subscript𝑠𝜓2subscript𝑠𝑜2s_{\psi}+2s_{o}-2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2, the position error increases linearly with it as sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, because now the object can be at an increasing number of positions within the decoder and still achieve perfect reconstructions (fig. 2(b)). At sϕ=sϕ+2so2subscript𝑠italic-ϕsubscript𝑠italic-ϕ2subscript𝑠𝑜2s_{\phi}=s_{\phi}+2s_{o}-2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2, the maximum errors from encoder and decoder both become equal to sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1. For sϕ>sψ+2so2subscript𝑠italic-ϕsubscript𝑠𝜓2subscript𝑠𝑜2s_{\phi}>s_{\psi}+2s_{o}-2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 2, the position error is dominated by the error from the encoder which is constant at sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1, and so even though the decoder RF size is increasing, this has no effect as the limiting factor is now the encoder. Similar to corollary 4.2, the results of corollary 4.3 can be extended to objects with multiple different sizes (see appendix B, cor. B.2).

Corollary 4.4 (Error Bound vs. Object Size).

The maximum position error as a function of the object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT for a given sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is

Δ(so)={sψ2+so21if 1sosϕ2sψ2+1,sϕ2so2+ΔGifsϕ2sψ2+1<sosϕ.Δsubscript𝑠𝑜casessubscript𝑠𝜓2subscript𝑠𝑜21if1subscript𝑠𝑜subscript𝑠italic-ϕ2subscript𝑠𝜓21subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺ifsubscript𝑠italic-ϕ2subscript𝑠𝜓21subscript𝑠𝑜subscript𝑠italic-ϕ\Delta(s_{o})=\begin{cases}\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1&\mathrm{if\ \ % }1\leq s_{o}\leq\frac{s_{\phi}}{2}-\frac{s_{\psi}}{2}+1,\\ \frac{s_{\phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}&\mathrm{if\ \ }\frac{s_{\phi}}{2% }-\frac{s_{\psi}}{2}+1<s_{o}\leq s_{\phi}.\end{cases}roman_Δ ( italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if 1 ≤ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ≤ divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + 1 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + 1 < italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT . end_CELL end_ROW

For an illustration see fig. 3(c). Again, there are two regions of the curve, one for so<sϕ/2sψ/2+1subscript𝑠𝑜subscript𝑠italic-ϕ2subscript𝑠𝜓21s_{o}<s_{\phi}/2-s_{\psi}/2+1italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + 1 (left), where the error is dominated by the encoder error, and one for sosϕ/2sψ/2+1subscript𝑠𝑜subscript𝑠italic-ϕ2subscript𝑠𝜓21s_{o}\geq s_{\phi}/2-s_{\psi}/2+1italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ≥ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + 1 (right), where the error is dominated by the decoder error. Initially, for so=1subscript𝑠𝑜1s_{o}=1italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1, the error is given by sψ/21/2subscript𝑠𝜓212s_{\psi}/2-1/2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 - 1 / 2, because any pixel of the encoder receptive field can match the 1×1111\times 11 × 1 px object and so the error can be at most half of the encoder receptive field size. As the object size increases up to sϕ/2sψ/2+1subscript𝑠italic-ϕ2subscript𝑠𝜓21s_{\phi}/2-s_{\psi}/2+1italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + 1, the position error increases linearly with it as sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1, because now any part of the encoder RF can match any part of the object (fig. 2(a)). At so=sϕ/2sψ/2+1subscript𝑠𝑜subscript𝑠italic-ϕ2subscript𝑠𝜓21s_{o}=s_{\phi}/2-s_{\psi}/2+1italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + 1, the maximum errors from encoder and decoder both become equal to sψ/4+sϕ/41/2subscript𝑠𝜓4subscript𝑠italic-ϕ412s_{\psi}/4+s_{\phi}/4-1/2italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 4 + italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 4 - 1 / 2. For so>sϕ/2sψ/2+1subscript𝑠𝑜subscript𝑠italic-ϕ2subscript𝑠𝜓21s_{o}>s_{\phi}/2-s_{\psi}/2+1italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + 1, the position error is dominated by the error from the decoder and decreases linearly as sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, because now there is a decreasing number of positions where the object can still fit inside the decoder receptive field (fig. 2(b)). At so=sϕsubscript𝑠𝑜subscript𝑠italic-ϕs_{o}=s_{\phi}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, the object reaches the same size as the decoder, and thus the position error decreases to 00 with an additional error due to the width of the Gaussian, ΔGsubscriptΔ𝐺\Delta_{G}roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Interestingly, the triangular shape of the error curve means that small and large objects will both incur small position errors, while medium sized objects will incur higher errors.

Corollary 4.5 (Error Bound vs. Gaussian Size).

The maximum position error as a function of the Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for a given sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, is

Δ(σG)={sϕ2so2+ΔGifσG<sψ2sϕ2+so1,sψ2+so21ifσGsψ2sϕ2+so1.Δsubscript𝜎𝐺casessubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺ifsubscript𝜎𝐺subscript𝑠𝜓2subscript𝑠italic-ϕ2subscript𝑠𝑜1subscript𝑠𝜓2subscript𝑠𝑜21ifsubscript𝜎𝐺subscript𝑠𝜓2subscript𝑠italic-ϕ2subscript𝑠𝑜1\Delta(\sigma_{G})=\begin{cases}\frac{s_{\phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}&% \mathrm{if\ \ }\sigma_{G}<\frac{s_{\psi}}{2}-\frac{s_{\phi}}{2}+s_{o}-1,\\ \frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1&\mathrm{if\ \ }\sigma_{G}\geq\frac{s_{% \psi}}{2}-\frac{s_{\phi}}{2}+s_{o}-1.\end{cases}roman_Δ ( italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT < divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ≥ divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 . end_CELL end_ROW

For an illustration see fig. 3(d). Firstly, there is an overall maximum bound for the position error due to the encoder, given by the constant sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1, which is independent of the Gaussian standard deviation. Then, initially for σG=0subscript𝜎𝐺0\sigma_{G}=0italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0, the rendered Gaussian is effectively a delta function, and so the position error is dominated by the decoder error given by sϕ/2so/2subscript𝑠italic-ϕ2subscript���𝑜2s_{\phi}/2-s_{o}/2italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2, which describes the maximum distance between the object and the delta function with both of them fitting inside the decoder receptive field (fig. 3(b)). As the Gaussian standard deviation increases, the position error increases linearly as sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT with ΔG𝒩(0,σG2)similar-tosubscriptΔ𝐺𝒩0superscriptsubscript𝜎𝐺2\Delta_{G}\sim\mathcal{N}(0,\sigma_{G}^{2})roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Then, depending on what part of the Gaussian the decoder is convolved with, there are different bounds for the maximum position error. If the decoder is convolved with a part of the Gaussian that is within n𝑛nitalic_n standard deviations of its centre, the maximum position error increases linearly as sϕ/2so/2+nσGsubscript𝑠italic-ϕ2subscript𝑠𝑜2𝑛subscript𝜎𝐺s_{\phi}/2-s_{o}/2+n\sigma_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + italic_n italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT up to σG=(sψ/2sϕ/2+so1)/nsubscript𝜎𝐺subscript𝑠𝜓2subscript𝑠italic-ϕ2subscript𝑠𝑜1𝑛\sigma_{G}=(s_{\psi}/2-s_{\phi}/2+s_{o}-1)/nitalic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 ) / italic_n, after which point the position error becomes dominated by the encoder value of sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1. In fig. 3(d), the maximum position error bound when the decoder is convolved with a part of the Gaussian within its first and second standard deviations, is denoted by darker and lighter shades of blue, respectively.

5 Experimental Results

In this section we validate our theoretical results on synthetic experiments (sec. 5.1) and CLEVR data (sec. 5.2). Additional real video experiments are in app. E. We first validate corollaries 4.2-4.4 via synthetic experiments in sec. 5.1, demonstrating very high agreement up to sizes of individual pixels. We then apply our method to CLEVR-based johnson2017clevr data containing multiple objects of different sizes in varying scenes (sec. 5.2) and show that compared to current SOTA object detection methods (SAM kirillov2023sam , CutLER wang2023cutler ), only our method predicts positions within theoretical bounds.

5.1 Synthetic Experiments

Refer to caption
Refer to caption
(a) Error vs. Encoder RF.
Refer to caption
(b) Error vs. Decoder RF.
Refer to caption
(c) Error vs. Object Size.
Refer to caption
(d) Error vs. Gaussian S.D.
Figure 4: Synthetic experiment results showing position error as a function of the encoder receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, as the remaining factors are fixed to sψ=9,sϕ=25,so=9,σG=0.8formulae-sequencesubscript𝑠𝜓9formulae-sequencesubscript𝑠italic-ϕ25formulae-sequencesubscript𝑠𝑜9subscript𝜎𝐺0.8s_{\psi}=9,s_{\phi}=25,s_{o}=9,\sigma_{G}=0.8italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 25 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 9 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.8 (in a,b,c) or to sψ=9,sϕ=11,so=7formulae-sequencesubscript𝑠𝜓9formulae-sequencesubscript𝑠italic-ϕ11subscript𝑠𝑜7s_{\psi}=9,s_{\phi}=11,s_{o}=7italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 11 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 7 (in d). Theoretical bounds are denoted by a blue line (with 4 shaded regions denoting 1 to 4 standard deviations of the probabilistic bound) and experimental results by red dots.

In this section we validate corollaries 4.2-4.5 via synthetic experiments. Our dataset consists of a small white square on a black background. In each experiment we fix all but one of the encoder RF size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder RF size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and vary the remaining variable. We perform each experiment 20 times, corresponding to 20 random initializations of the trained parameters, and record the position error ΔΔ\Deltaroman_Δ as the difference between the predicted object position z𝑧zitalic_z and the ground truth object position zGTsubscript𝑧𝐺𝑇z_{GT}italic_z start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT. For more details see appendix C.1.

Position Error vs. Encoder RF Size.

In this experiment we aim to empirically validate corollary 4.2, by measuring the experimental position errors as a function of the encoder receptive field size. We vary the encoder RF sizes sψ{1,3,,31}subscript𝑠𝜓1331s_{\psi}\in\{1,3,\ldots,31\}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∈ { 1 , 3 , … , 31 } and fix sϕ=25,so=9,σG=.8formulae-sequencesubscript𝑠italic-ϕ25formulae-sequencesubscript𝑠𝑜9subscript𝜎𝐺.8s_{\phi}=25,s_{o}=9,\sigma_{G}=.8italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 25 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 9 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = .8 and record position errors ΔΔ\Deltaroman_Δ. We visualise the data points (red) and the theoretical bounds (blue) in fig. 4(a). We can observe that all the data points lie at or below the theoretical boundary, which validates corollary 4.2. In particular, we observe that the deterministic boundary in the region to the left of the dashed line (corresponding to the encoder bound) is well respected, with some of the trained networks achieving exactly the maximum error predicted by theory.

Position Error vs. Decoder RF Size.

In this experiment we aim to validate corollary 4.3 by measuring the experimental position errors as a function of the decoder receptive field size. We vary the decoder RF sizes sϕ{1,3,,31}subscript𝑠italic-ϕ1331s_{\phi}\in\{1,3,\ldots,31\}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ { 1 , 3 , … , 31 } and fix sϕ=9,so=9,σG=.8formulae-sequencesubscript𝑠italic-ϕ9formulae-sequencesubscript𝑠𝑜9subscript𝜎𝐺.8s_{\phi}=9,s_{o}=9,\sigma_{G}=.8italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 9 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = .8 and record position errors ΔΔ\Deltaroman_Δ. We visualise the results in fig. 4(b). The figure shows the theory to be a strong fit to the data, validating corollary 4.3. In particular, we note that the data points fit the Gaussian distribution in the decoder part of the curve (left of the dashed line) and are very close to (1 px below) the deterministic upper bound in the encoder part of the curve (right).

Position Error vs. Object Size.

In this experiment we aim to validate corollary 4.4 by measuring the experimental position errors as a function of the object size. We vary the object sizes so{1,3,,25}subscript𝑠𝑜1325s_{o}\in\{1,3,\ldots,25\}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ { 1 , 3 , … , 25 } and fix sψ=9,sϕ=25,σG=.8formulae-sequencesubscript𝑠𝜓9formulae-sequencesubscript𝑠italic-ϕ25subscript𝜎𝐺.8s_{\psi}=9,s_{\phi}=25,\sigma_{G}=.8italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 25 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = .8 and record position errors ΔΔ\Deltaroman_Δ. We visualise the results in fig. 4(c). As all the data points lie at or below the theoretical boundary, this validates corollary 4.4. We note that the empirical distribution of errors follows very closely the shape of the theoretical bound, very strictly on the left side of the dashed line (encoder bound) and according to the distribution predicted on the right side (decoder bound).

Position Error vs. Gaussian Size.

In this experiment we aim to validate corollary 4.5 by measuring the experimental position errors as a function of the Gaussian standard deviation. We vary the Gaussian standard deviations σG{0.1,0.2,,2.1,2.25,2.5,,5}subscript𝜎𝐺0.10.22.12.252.55\sigma_{G}\in\{0.1,0.2,\ldots,2.1,2.25,2.5,...,5\}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∈ { 0.1 , 0.2 , … , 2.1 , 2.25 , 2.5 , … , 5 }, fixing sψ=9,sϕ=11,so=7formulae-sequencesubscript𝑠𝜓9formulae-sequencesubscript𝑠italic-ϕ11subscript𝑠𝑜7s_{\psi}=9,s_{\phi}=11,s_{o}=7italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 11 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 7 and record position errors ΔΔ\Deltaroman_Δ. We visualise the data points (red) and the theoretical bounds (blue) in fig. 4(d). As all the data points lie at or below the theoretical boundary, this validates corollary 4.5. In particular, we note that all the data points lie below the encoder bound (solid blue line), and all the data points lie within the bound denoted by four standard deviations away from the Gaussian. This means that in practice, the decoder can be convolved with any part of the Gaussian that lies within 4 standard deviations (corresponding to 3.2 px) from its centre. We also note that as the Gaussian standard deviation increases, the position error increases as expected, denoted by the positive slope of the data points between the third and fourth standard deviations (lightest shade of blue).

Refer to caption
Refer to caption
(a) Error vs. Encoder RF.
Refer to caption
(b) Error vs. Decoder RF.
Refer to caption
(c) Error vs. Object Size.
Refer to caption
(d) Error vs. Gaussian S.D.
Figure 5: CLEVR experiment results showing position error as a function of the encoder receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, and object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, as the remaining factors are fixed to sψ=9,sϕ=25,so[6,10],σG=0.8formulae-sequencesubscript𝑠𝜓9formulae-sequencesubscript𝑠italic-ϕ25formulae-sequencesubscript𝑠𝑜610subscript𝜎𝐺0.8s_{\psi}=9,s_{\phi}=25,s_{o}\in[6,10],\sigma_{G}=0.8italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 25 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ 6 , 10 ] , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.8 for (a)-(c) and to sψ=5,sϕ=13,so[6,10]formulae-sequencesubscript𝑠𝜓5formulae-sequencesubscript𝑠italic-ϕ13subscript𝑠𝑜610s_{\psi}=5,s_{\phi}=13,s_{o}\in[6,10]italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 5 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 13 , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ 6 , 10 ] for (d). Theoretical bounds are denoted by blue, experimental results in red, SAM baseline in green, and CutLER baseline in orange.

5.2 CLEVR Experiments

In this section we validate our theory on CLEVR-based johnson2017clevr image data of 3D scenes. Our dataset consists of 3 spheres of different colours at random positions, with a range of sizes due to perspective distortion. We train and evaluate each experiment similarly to those in sec. 5.1, recording position errors for the learned objects, and compare our results with SAM kirillov2023sam and CutLER wang2023cutler baselines. We compute the theoretical bounds according to our theory in sec. 4 and app. B, and visualise the results in figs. 5(a)-5(c). For details see app. C.2. For experiments with different shapes see app. D.

Once again, the experimental results demonstrate high agreement with our theory, now even for more complex images with multiple objects and a range of object sizes (fig. 5, red, blue). Furthermore, while the SAM and CutLER baselines generally achieve low position errors, this is not guaranteed, and in some cases their errors are much higher than our bound (fig. 5, green, orange). We report the proportion of position errors from fig. 5(c) that lie within 2 standard deviations of our theoretical bound in table 6(b) and fig. 6(a), showing that compared to SOTA object detection methods, only for our method are the position errors always guaranteed to be within our theoretical bound.

Refer to caption
(a) Results visualisation.
Errors Within Bound (%)
Object Size ±plus-or-minus\pm± 1.5 (px)
Method All 9 12 15 18 21 24
Ours 100.0 100.0 100.0 100.0 100.0 100.0 100.0
CutLER 78.4 100.0 100.0 75.0 57.1 83.3 37.5
SAM 43.5 83.3 60.0 54.5 23.1 0.0 16.7
(b) Results table.
Figure 6: Proportion of position errors within 2 standard deviations of the theoretical bound (%), reported for different object sizes and methods. Results from table (b) are visualised in plot (a).

6 Discussion

In light of our theoretical results, in this section we present some conclusions that can be drawn when designing new unsupervised object detection methods:

  1. 1.

    If the size of the objects that will be detected is known, to minimise the error on the learned object positions, one should aim to design the decoder receptive field size to be as small as possible while still encompassing the object. As the decoder RF grows beyond the object size, the error bound increases linearly with it up to a certain point (fig. 3(b)).

  2. 2.

    To minimise the error stemming from the encoder for a given object size, the encoder RF size should be kept as small as possible while still detecting the object (the RF size may be smaller than the object size), as again the error bound grows linearly with it up to a certain point (fig. 3(a)).

  3. 3.

    To minimise the error, the width of the rendering Gaussian should be kept as small as possible while still permitting gradient flow, as increasing it even slightly may result in a dramatic increase to the decoder term of the position error (fig. 3(d)). This is because, in practice, the decoder is able to detect parts of the Gaussian that are even 4 standard deviations away from its centre (fig. 4(d)).

  4. 4.

    In the case that one does not know a priori the exact size of the objects to be detected, one can still design a network that minimises the position errors for a given range of sizes. In that case, one should set up the decoder RF size to be as close as possible to the size of the largest object, and keep the encoder RF size as small as possible while still detecting all objects. The position errors for different object sizes will then be distributed according to the curve in fig. 3(c), where the smallest and largest objects will achieve lowest errors and medium-size objects will achieve the greatest error, approximately given by a half of the average of the encoder and decoder RF sizes.

Finally, we discuss some limitations of our method. Firstly, the method can only detect dynamic objects, for example if they move in a video or if they appear at multiple locations in images. Secondly, in its current form the method learns representations that can not be used for videos with different backgrounds than the one used at training time; however, this can be overcome by conditioning the decoder on an unrelated video frame instead of the positional encodings, as in Jakab et al. jakab2018keypoints . Thirdly, the guarantees of our method are conditional on the images being successfully reconstructed, which depends on the network architecture and optimisation method.

7 Conclusion

We have presented the first unsupervised object detection method that is provably guaranteed to recover the true object positions up to small shifts. We proved that the object positions are learned up to a maximum error related to the encoder and decoder receptive field sizes, the object sizes, and bandwidth of the Gaussians used to render the objects. We derived expressions for how the position error depends on each of these factors and performed synthetic experiments that validated our theory up to sizes of individual pixels. We then performed experiments on CLEVR-based data, showing that unlike current SOTA methods, the position errors our method are always guaranteed to be within our theoretical bounds. We hope our work will provide a starting point for more research into object detection methods that possess theoretical guarantees, which are lacking in current practice.

Acknowledgements.

The authors acknowledge the generous support of the Royal Academy of Engineering (RF\201819\18\163), the Royal Society (RG\R1\241385) and EPSRC (VisualAI, EP/T028572/1).

References

  • [1] Johann Brehmer, Pim De Haan, Phillip Lippe, and Taco Cohen. Weakly supervised causal representation learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [2] Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation, 2019.
  • [3] Abdelkader Dairi, Fouzi Harrou, Mohamed Senouci, and Ying Sun. Unsupervised obstacle detection in driving environments using deep-learning-based stereovision. Robotics and Autonomous Systems, 100:287–301, 2018.
  • [4] Eran Goldman, Roei Herzig, Aviv Eisenschtat, Jacob Goldberger, and Tal Hassner. Precise detection in densely packed scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5227–5236, 2019.
  • [5] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International conference on machine learning, pages 2424–2433. PMLR, 2019.
  • [6] Luigi Gresele, Julius von Kügelgen, Vincent Stimper, Bernhard Schölkopf, and Michel Besserve. Independent mechanism analysis, a new concept? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28233–28248. Curran Associates, Inc., 2021.
  • [7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [8] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • [9] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [10] Aapo Hyvarinen and Hiroshi Morioka. Nonlinear ICA of Temporally Dependent Stationary Sources. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 460–469. PMLR, 20–22 Apr 2017.
  • [11] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks through conditional image generation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [12] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  • [13] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [14] Sebastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, Rémi LE PRIOL, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In First Conference on Causal Learning and Reasoning, 2022.
  • [15] Phillip Lippe, Sara Magliacane, Sindy Löwe, Yuki M Asano, Taco Cohen, and Stratis Gavves. CITRIS: Causal identifiability from temporal intervened sequences. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13557–13603. PMLR, 17–23 Jul 2022.
  • [16] Francesco Locatello, Ben Poole, Gunnar Raetsch, Bernhard Schölkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6348–6359. PMLR, 13–18 Jul 2020.
  • [17] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. Advances in neural information processing systems, 33:11525–11538, 2020.
  • [18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • [19] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pages 146–157. Springer, 2017.
  • [20] Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(72):2003–2030, 2006.
  • [21] Oriane Siméoni, Chloé Sekkat, Gilles Puy, Antonín Vobeckỳ, Éloi Zablocki, and Patrick Pérez. Unsupervised object localization: Observing the background to discover objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3176–3186, 2023.
  • [22] Oriane Siméoni, Éloi Zablocki, Spyros Gidaris, Gilles Puy, and Patrick Pérez. Unsupervised object localization in the era of self-supervised vits: A survey. International Journal of Computer Vision, pages 1–28, 2024.
  • [23] Xudong Wang, Rohit Girdhar, Stella X Yu, and Ishan Misra. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3124–3134, 2023.
  • [24] Quanhan Xi and Benjamin Bloem-Reddy. Indeterminacy in generative models: Characterization and strong identifiability. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6912–6939. PMLR, 25–27 Apr 2023.

Appendix A Proof of Theorem 4.1

Theorem (Maximum Position Error).

Consider a set of images xXsimilar-to𝑥𝑋x\sim Xitalic_x ∼ italic_X with objects of size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, CNN encoder ψ𝜓\psiitalic_ψ with receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, CNN decoder ϕitalic-ϕ\phiitalic_ϕ with receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, soft argmax function softargmaxsoftargmax\mathrm{softargmax}roman_softargmax, rendering function renderrender\mathrm{render}roman_render with Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and ΔG𝒩(0,σG2)similar-tosubscriptΔ𝐺𝒩0superscriptsubscript𝜎𝐺2\Delta_{G}\sim\mathcal{N}(0,\sigma_{G}^{2})roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and latent variables z𝑧zitalic_z, composed as z=softargmaxψx𝑧softargmax𝜓𝑥z=\mathrm{softargmax}\circ\psi\circ xitalic_z = roman_softargmax ∘ italic_ψ ∘ italic_x and x^=ϕrenderz^𝑥italic-ϕrender𝑧\hat{x}=\phi\circ\mathrm{render}\circ zover^ start_ARG italic_x end_ARG = italic_ϕ ∘ roman_render ∘ italic_z (fig. 1). Assuming (1) the objects are reconstructed at the same positions as in the original images, (2) each object appears in at least two different positions in the dataset, and (3) there are no two identical objects in any image, then the learned latent variables z𝑧zitalic_z correspond to the true object positions up to object permutations and maximum position errors ΔΔ\Deltaroman_Δ of

Δ=min(sψ2+so21,sϕ2so2+ΔG).Δsubscript𝑠𝜓2subscript𝑠𝑜21subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺\Delta=\min\left(\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1,\frac{s_{\phi}}{2}-\frac% {s_{o}}{2}+\Delta_{G}\right).roman_Δ = roman_min ( divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 , divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) . (10)
Proof.

By assumption (1), the positions (z1,z2)subscript𝑧1subscript𝑧2(z_{1},z_{2})( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) objects in the original image x𝑥xitalic_x have to be the same as the positions (z^1,z^2)subscript^𝑧1subscript^𝑧2(\hat{z}_{1},\hat{z}_{2})( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) of the objects in the reconstructed image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. In practice, this occurs whenever the reconstruction loss is minimised.

By assumption (2) (each object appears at a minimum of 2 different positions), the latent variables used by the decoder have to contain some information about each object, and thus the encoder has to learn to match all the objects. This is because the decoder CNN ϕitalic-ϕ\phiitalic_ϕ takes as its input the rendered Gaussians e^=renderz^𝑒render𝑧\hat{e}=\mathrm{render}\circ zover^ start_ARG italic_e end_ARG = roman_render ∘ italic_z concatenated with positional encodings or a randomly sampled nearby frame (fig. 1), and if some object in the dataset only appeared at a single position the model could achieve perfect reconstruction solely by using the positional encodings or the conditioned image without having to use the Gaussian maps e^^𝑒\hat{e}over^ start_ARG italic_e end_ARG. However, because the dataset contains each object at a minimum of 2 positions, relying purely on positional encodings or on the conditioned image is now not sufficient, as without any information about the object passed to the decoder, it would be impossible for it to predict where to render the object. More formally, because x^=ϕrendersoftargmaxψx^𝑥italic-ϕrendersoftargmax𝜓𝑥\hat{x}=\phi\circ\mathrm{render}\circ\mathrm{softargmax}\circ\psi\circ xover^ start_ARG italic_x end_ARG = italic_ϕ ∘ roman_render ∘ roman_softargmax ∘ italic_ψ ∘ italic_x, this means that for objects in x𝑥xitalic_x to be reconstructed at the same positions as in x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG (assumption 1), the encoder ψ𝜓\psiitalic_ψ needs to match some part of each object in x𝑥xitalic_x.

Because z=softargmaxψx𝑧softargmax𝜓𝑥z=\mathrm{softargmax}\circ\psi\circ xitalic_z = roman_softargmax ∘ italic_ψ ∘ italic_x is equivariant to translations of x𝑥xitalic_x (sec. 3), and because the encoder ψ𝜓\psiitalic_ψ has to match some part of each object in x𝑥xitalic_x (as shown previously), and also each image consists of distinct objects (assumption 3) on a known background, the image x𝑥xitalic_x with an object at the position (u1,u2)subscript𝑢1subscript𝑢2(u_{1},u_{2})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is encoded by softargmaxψsoftargmax𝜓\mathrm{softargmax}\circ\psiroman_softargmax ∘ italic_ψ to the latent variables

(z1,z2)=(u1+Δψ1,u2+Δψ2),|Δψ1|,|Δψ2|sψ2+so21formulae-sequencesubscript𝑧1subscript𝑧2subscript𝑢1subscriptΔ𝜓1subscript𝑢2subscriptΔ𝜓2subscriptΔ𝜓1subscriptΔ𝜓2subscript𝑠𝜓2subscript𝑠𝑜21\displaystyle(z_{1},z_{2})=(u_{1}+\Delta_{\psi 1},u_{2}+\Delta_{\psi 2}),\quad% |\Delta_{\psi 1}|,|\Delta_{\psi 2}|\leq\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT ) , | roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT | , | roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT | ≤ divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 (11)

where the shifts Δψ1subscriptΔ𝜓1\Delta_{\psi 1}roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT and Δψ2subscriptΔ𝜓2\Delta_{\psi 2}roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT arise because any part of the encoder filter (of size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT) can match any part of the object (of size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT). See fig. 2(a) for an illustration.

Next, because x^=ϕrenderz^𝑥italic-ϕrender𝑧\hat{x}=\phi\circ\mathrm{render}\circ zover^ start_ARG italic_x end_ARG = italic_ϕ ∘ roman_render ∘ italic_z is equivariant to translations of z𝑧zitalic_z (sec. 3), the latent variables z=(u1+Δψ1,u2+Δψ2)𝑧subscript𝑢1subscriptΔ𝜓1subscript𝑢2subscriptΔ𝜓2z=(u_{1}+\Delta_{\psi 1},u_{2}+\Delta_{\psi 2})italic_z = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT ) are mapped to a predicted image x^=ϕrenderz^𝑥italic-ϕrender𝑧\hat{x}=\phi\circ\mathrm{render}\circ zover^ start_ARG italic_x end_ARG = italic_ϕ ∘ roman_render ∘ italic_z with an object at position

(z^1,z^2)=(u1+Δψ1+Δϕ1,u2+Δψ2+Δϕ2),|Δϕ1|,|Δϕ2|sϕ2so2+ΔGformulae-sequencesubscript^𝑧1subscript^𝑧2subscript𝑢1subscriptΔ𝜓1subscriptΔitalic-ϕ1subscript𝑢2subscriptΔ𝜓2subscriptΔitalic-ϕ2subscriptΔitalic-ϕ1subscriptΔitalic-ϕ2subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺\displaystyle(\hat{z}_{1},\hat{z}_{2})=(u_{1}+\Delta_{\psi 1}+\Delta_{\phi 1},% u_{2}+\Delta_{\psi 2}+\Delta_{\phi 2}),\quad|\Delta_{\phi 1}|,|\Delta_{\phi 2}% |\leq\frac{s_{\phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT ) , | roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT | , | roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT | ≤ divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (12)

where the shifts Δϕ1subscriptΔitalic-ϕ1\Delta_{\phi 1}roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT and Δϕ2subscriptΔitalic-ϕ2\Delta_{\phi 2}roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT arise because any part of the decoder filter (of size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT) can match any part of the the rendered Gaussian e^t=renderzsubscript^𝑒𝑡render𝑧\hat{e}_{t}=\mathrm{render}\circ zover^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_render ∘ italic_z, where ΔG𝒩(0,σG2)similar-tosubscriptΔ𝐺𝒩0superscriptsubscript𝜎𝐺2\Delta_{G}\sim\mathcal{N}(0,\sigma_{G}^{2})roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). See fig. 2(b) for illustration.

Finally, by assumption (1), the position of each object in the original image x𝑥xitalic_x has to be equal to the position of the object in the reconstructed image, i.e.

(u1,u2)=(u1+Δψ1+Δϕ1,u2+Δψ2+Δϕ2)subscript𝑢1subscript𝑢2subscript𝑢1subscriptΔ𝜓1subscriptΔitalic-ϕ1subscript𝑢2subscriptΔ𝜓2subscriptΔitalic-ϕ2(u_{1},u_{2})=(u_{1}+\Delta_{\psi 1}+\Delta_{\phi 1},u_{2}+\Delta_{\psi 2}+% \Delta_{\phi 2})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT ) (13)

This results in the conditions

Δψ1+Δϕ1=0,Δψ2+Δϕ2=0formulae-sequencesubscriptΔ𝜓1subscriptΔitalic-ϕ10subscriptΔ𝜓2subscriptΔitalic-ϕ20\Delta_{\psi 1}+\Delta_{\phi 1}=0,\quad\Delta_{\psi 2}+\Delta_{\phi 2}=0roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT = 0 , roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT = 0 (14)

and therefore

|Δψ1|=|Δϕ1|,|Δψ2|=|Δϕ2|formulae-sequencesubscriptΔ𝜓1subscriptΔitalic-ϕ1subscriptΔ𝜓2subscriptΔitalic-ϕ2|\Delta_{\psi 1}|=|\Delta_{\phi 1}|,\quad|\Delta_{\psi 2}|=|\Delta_{\phi 2}|| roman_Δ start_POSTSUBSCRIPT italic_ψ 1 end_POSTSUBSCRIPT | = | roman_Δ start_POSTSUBSCRIPT italic_ϕ 1 end_POSTSUBSCRIPT | , | roman_Δ start_POSTSUBSCRIPT italic_ψ 2 end_POSTSUBSCRIPT | = | roman_Δ start_POSTSUBSCRIPT italic_ϕ 2 end_POSTSUBSCRIPT | (15)

In words, the shift in the latent variables acquired from the encoder ΔψsubscriptΔ𝜓\Delta_{\psi}roman_Δ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT has to be balanced by an opposite shift of the same magnitude in the decoder ΔϕsubscriptΔitalic-ϕ\Delta_{\phi}roman_Δ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT in order to reconstruct the object at the same position. Because the shift due to the encoder is of maximum magnitude of sψ/2+so/21subscript𝑠𝜓2subscript𝑠𝑜21s_{\psi}/2+s_{o}/2-1italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT / 2 + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 - 1 and the shift due to the decoder has a maximum magnitude of sϕ/2so/2+ΔGsubscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺s_{\phi}/2-s_{o}/2+\Delta_{G}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 2 - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, this means that the maximum magnitude of the shift of the latent variables has to be the minimum of these two expressions, i.e. the learned latent variables (z1,z2)subscript𝑧1subscript𝑧2(z_{1},z_{2})( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) correspond to the ground truth latent variables (u1,u2)subscript𝑢1subscript𝑢2(u_{1},u_{2})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) up to

(z1,z2)=(u1+Δ1,u2+Δ2),|Δ1|,|Δ2|min(sψ2+so21,sϕ2so2+ΔG).formulae-sequencesubscript𝑧1subscript𝑧2subscript𝑢1subscriptΔ1subscript𝑢2subscriptΔ2subscriptΔ1subscriptΔ2subscript𝑠𝜓2subscript𝑠𝑜21subscript𝑠italic-ϕ2subscript𝑠𝑜2subscriptΔ𝐺\displaystyle(z_{1},z_{2})=(u_{1}+\Delta_{1},u_{2}+\Delta_{2}),\quad|\Delta_{1% }|,|\Delta_{2}|\leq\min\left(\frac{s_{\psi}}{2}+\frac{s_{o}}{2}-1,\frac{s_{% \phi}}{2}-\frac{s_{o}}{2}+\Delta_{G}\right).( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , | roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , | roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ≤ roman_min ( divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - 1 , divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) . (16)

Additionally, because the order in which the objects get mapped to each latent variable is arbitrary, there is an additional indeterminacy arising due to variable permutations. ∎

Appendix B Theoretical Results for Multiple Object Sizes

The results of corollary 4.2 can be extended to objects with a range of different sizes. The bound can be obtained by taking the maximum over all the bounds for objects with sizes so[somin,somax]subscript𝑠𝑜superscriptsubscript𝑠𝑜𝑚𝑖𝑛superscriptsubscript𝑠𝑜𝑚𝑎𝑥s_{o}\in[s_{o}^{min},s_{o}^{max}]italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ], leading to the following corollary.

Corollary B.1 (Error vs. Encoder RF Size for Multiple Object Sizes).

The maximum position error as a function of the encoder receptive field (RF) size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT for a given sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, so[somin,somax]subscript𝑠𝑜superscriptsubscript𝑠𝑜𝑚𝑖𝑛superscriptsubscript𝑠𝑜𝑚𝑎𝑥s_{o}\in[s_{o}^{min},s_{o}^{max}]italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ], σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is

Δ(sψ)={sψ2+somax21if 1sψsϕ2somax+2,sψ4+sϕ412+ΔGifsϕ2somax+2<sψsϕ2somin+2,sϕ2somin2+ΔGifsψ>sϕ2somin+2.Δsubscript𝑠𝜓casessubscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑎𝑥21if1subscript𝑠𝜓subscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑎𝑥2subscript𝑠𝜓4subscript𝑠italic-ϕ412subscriptΔ𝐺ifsubscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑎𝑥2subscript𝑠𝜓subscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2subscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2subscriptΔ𝐺ifsubscript𝑠𝜓subscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2\Delta(s_{\psi})=\begin{cases}\frac{s_{\psi}}{2}+\frac{s_{o}^{max}}{2}-1&% \mathrm{if\ \ }1\leq s_{\psi}\leq s_{\phi}-2s_{o}^{max}+2,\\ \frac{s_{\psi}}{4}+\frac{s_{\phi}}{4}-\frac{1}{2}+\Delta_{G}&\mathrm{if\ \ }s_% {\phi}-2s_{o}^{max}+2<s_{\psi}\leq s_{\phi}-2s_{o}^{min}+2,\\ \frac{s_{\phi}}{2}-\frac{s_{o}^{min}}{2}+\Delta_{G}&\mathrm{if\ \ }s_{\psi}>s_% {\phi}-2s_{o}^{min}+2.\end{cases}roman_Δ ( italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if 1 ≤ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT + 2 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT + 2 < italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 2 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 2 . end_CELL end_ROW

For an illustration see fig. 7(a).

Similar to corollary 4.2, the results of corollary 4.3 can be extended to objects with a range of different sizes by taking the maximum over the bounds for objects with sizes so[somin,somax]subscript𝑠𝑜superscriptsubscript𝑠𝑜𝑚𝑖𝑛superscriptsubscript𝑠𝑜𝑚𝑎𝑥s_{o}\in[s_{o}^{min},s_{o}^{max}]italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ], leading to the following corollary.

Corollary B.2 (Error vs. Decoder RF Size for Multiple Object Sizes).

The maximum position error as a function of the decoder receptive field (RF) size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT for a given sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, so[somin,somax]subscript𝑠𝑜superscriptsubscript𝑠𝑜𝑚𝑖𝑛superscriptsubscript𝑠𝑜𝑚𝑎𝑥s_{o}\in[s_{o}^{min},s_{o}^{max}]italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ], σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, is

Δ(sϕ)={sϕ2somin2+ΔGifsominsϕsψ+2somin2,sψ4+sϕ412+ΔGifsψ+2somin2<sϕsψ+2somax2sψ2+somax21ifsϕ>sψ+2somax2.Δsubscript𝑠italic-ϕcasessubscript𝑠italic-ϕ2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2subscriptΔ𝐺ifsuperscriptsubscript𝑠𝑜𝑚𝑖𝑛subscript𝑠italic-ϕsubscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2subscript𝑠𝜓4subscript𝑠italic-ϕ412subscript��𝐺ifsubscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑖𝑛2subscript𝑠italic-ϕsubscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑎𝑥2subscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑎𝑥21ifsubscript𝑠italic-ϕsubscript𝑠𝜓2superscriptsubscript𝑠𝑜𝑚𝑎𝑥2\Delta(s_{\phi})=\begin{cases}\frac{s_{\phi}}{2}-\frac{s_{o}^{min}}{2}+\Delta_% {G}&\mathrm{if\ \ }s_{o}^{min}\leq s_{\phi}\leq s_{\psi}+2s_{o}^{min}-2,\\ \frac{s_{\psi}}{4}+\frac{s_{\phi}}{4}-\frac{1}{2}+\Delta_{G}&\mathrm{if\ \ }s_% {\psi}+2s_{o}^{min}-2<s_{\phi}\leq s_{\psi}+2s_{o}^{max}-2\\ \frac{s_{\psi}}{2}+\frac{s_{o}^{max}}{2}-1&\mathrm{if\ \ }s_{\phi}>s_{\psi}+2s% _{o}^{max}-2.\end{cases}roman_Δ ( italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT - 2 , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + roman_Δ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT - 2 < italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT - 2 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL roman_if italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT > italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT + 2 italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT - 2 . end_CELL end_ROW

For an illustration see fig. 7(b).

Refer to caption
(a) Position error vs. encoder RF size for a range of object sizes.
Refer to caption
(b) Position error vs. decoder RF size for a range of object sizes.
Figure 7: Theoretical bounds for the maximum position error as a function of the encoder receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and the decoder receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT for objects of sizes ranging from sominsuperscriptsubscript𝑠𝑜𝑚𝑖𝑛s_{o}^{min}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT to somaxsuperscriptsubscript𝑠𝑜𝑚𝑎𝑥s_{o}^{max}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT, as the remaining factors are kept constant. Each theoretical bound consists of two regions separated by a dashed line, one where the maximum error is due to the encoder (deterministic, represented by a solid line), and one where the maximum error is due to the decoder (probabilistic, represented by a solid line and shaded regions). Areas within one and two standard deviations of the mean are represented by a darker and lighter shades of blue respectively.

Appendix C Experiment Training Details

C.1 Synthetic Experiments

Dataset.

In all of the synthetic experiments, we use the following setup. The dataset consists of black images of size simg×simg=80×80subscript𝑠𝑖𝑚𝑔subscript𝑠𝑖𝑚𝑔8080s_{img}\times s_{img}=80\times 80italic_s start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT = 80 × 80 px, each with a white square with dimensions so×sosubscript𝑠𝑜subscript𝑠𝑜s_{o}\times s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT px, centered at positions (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) where x,y{spad+so/2,spad+so/2+1,,simgspadso/2}𝑥𝑦subscript𝑠𝑝𝑎𝑑subscript𝑠𝑜2subscript𝑠𝑝𝑎𝑑subscript𝑠𝑜21subscript𝑠𝑖𝑚𝑔subscript𝑠𝑝𝑎𝑑subscript𝑠𝑜2x,y\in\{s_{pad}+s_{o}/2,s_{pad}+s_{o}/2+1,\ldots,s_{img}-s_{pad}-s_{o}/2\}italic_x , italic_y ∈ { italic_s start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 , italic_s start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 + 1 , … , italic_s start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT / 2 }, where spadmax(sψ,sϕ)1subscript𝑠𝑝𝑎𝑑subscript𝑠𝜓subscript𝑠italic-ϕ1s_{pad}\geq\max(s_{\psi},s_{\phi})-1italic_s start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT ≥ roman_max ( italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) - 1 (to prevent unwanted edge effects). We divide the dataset into 4 quadrants and assign images from 3 quadrants to the training set and the remaining quadrant to the test set.

Evaluation.

For each experiment, we fix all but one variable from the following set: encoder RF size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder RF size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and vary the remaining variable. For each value of the investigated variable we perform 20 experiments with different random seeds, noting down the learned position error ΔΔ\Deltaroman_Δ as the absolute difference between the learned position z𝑧zitalic_z and the centre of the object zGTsubscript𝑧𝐺𝑇z_{GT}italic_z start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT (maximum over horizontal and vertical differences, and over the test set). We discard a result if the reconstruction accuracy of the run is below 99.9%percent99.999.9\%99.9 %, to only consider runs where the object has been detected successfully (this is because the square is on the order of 7×7777\times 77 × 7 px, thus only comprising 0.8%percent0.80.8\%0.8 % of the image).

Architecture.

We parametrise both the encoder ψ𝜓\psiitalic_ψ and the decoder ϕitalic-ϕ\phiitalic_ϕ as 5-layer CNNs with Batch Normalisation and ReLU activations, 32 channels, and filter sizes in {1×1,3×3,5×5,7×7}11335577\{1\times 1,3\times 3,5\times 5,7\times 7\}{ 1 × 1 , 3 × 3 , 5 × 5 , 7 × 7 }, such that their receptive field sizes equal sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT respectively. We train each network for 500 epochs using the Adam optimiser with learning rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and batch size 128. We train each experiment on a single GPU for around 6 hours with <6GB memory on an internal cluster.

C.2 CLEVR Experiments

Dataset.

Our CLEVR experiments use data generated with the CLEVR [12] image generation script. Our training and test sets consist of 150 and 50 images respectively, containing red, green and blue metallic spheres on a random background, at random positions, and with a different range of sizes (fig. 8). For experiments measuring position error as a function of encoder and decoder RF sizes and Gaussian s.d., we use a dataset with object sizes between 6-10 px after perspective distortion (fig. 8(a)). For the experiment measuring position error as a function of object size, we use 5 datasets with object sizes 9-14 px, 11-17 px, 13-19 px, 15-24 px, 17-27 px after perspective distortion (figs. 8(b), 8(c)).

Refer to caption
(a) Object sizes 6-10 px.
Refer to caption
(b) Object sizes 9-14 px.
Refer to caption
(c) Object sizes 17-27 px.
Figure 8: Samples from our CLEVR datasets, with object sizes (a) 6-10 px, (b) 9-14 px, (c) 17-27 px.

Evaluation.

For each experiment, we fix all but one variable from the following set: encoder RF size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, decoder RF size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, object size sosubscript𝑠𝑜s_{o}italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and Gaussian standard deviation σGsubscript𝜎𝐺\sigma_{G}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and vary the remaining variable. For each value of the investigated variable we perform experiments with different random seeds, noting down the learned position errors ΔΔ\Deltaroman_Δ as the absolute difference between the learned position z𝑧zitalic_z of each object and the centre of the object zGTsubscript𝑧𝐺𝑇z_{GT}italic_z start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT (maximum over horizontal and vertical differences and over the test set, for each object). We only consider results for objects that have been learned successfully. For experiments measuring position error as a function of encoder and decoder RF sizes and Gaussian s.d., we consider an object to be learned if the position error is less than 35 px, and if only a single variable corresponds to the object, and if the position error is stable over consecutive training iterations. For the experiment measuring position error as a function of object size, we only consider runs where all 3 objects have been learned, i.e. where the reconstruction accuracy is higher than 98.0% for the dataset with object sizes 6-10 px, 98.8% for dataset with sizes 11-17 px, 98.0% for dataset with sizes 13-19 px, 97.5% for dataset with sizes 15-24 px, and 98.0% for dataset with sizes 17-27 px.

Architecture.

We parametrise both the encoder ψ𝜓\psiitalic_ψ and the decoder ϕitalic-ϕ\phiitalic_ϕ as 5-layer CNNs with Batch Normalisation and ReLU activations, 32 channels, and filter sizes in {1×1,3×3,5×5,7×7}11335577\{1\times 1,3\times 3,5\times 5,7\times 7\}{ 1 × 1 , 3 × 3 , 5 × 5 , 7 × 7 }, such that their receptive field sizes equal sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT respectively. We train each network until convergence using the Adam optimiser with learning rate 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and batch size 128 for the experiments measuring position error as a function of encoder and decoder RF size and Gaussian s.d., and with learning rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and batch size 150 for the experiment measuring position error as a function of object size. We train each model on a single GPU for less than day with <8GB memory on an internal cluster.

Baselines.

We also evaluate the results for two State-of-the-Art baselines, SAM [13] and CutLER [23]. First, for each of our 6 datasets (containing objects of sizes 6-10 px, …, 17-27 px), we combine its training and test set to create 4 sets of 50 images each. We then apply SAM and CutLER to all images in each 50-image set, noting down the learned position errors ΔΔ\Deltaroman_Δ as the absolute difference between the predicted position z𝑧zitalic_z of each object and the centre of the object zGTsubscript𝑧𝐺𝑇z_{GT}italic_z start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT (maximum over horizontal and vertical differences and over the 50-image set, for each object). For SAM, we take the predicted object positions to be the centres of the bounding boxes corresponding to the second, third and fourth predicted masks (first mask corresponding to the background). For CutLER, we take the predicted object positions to be the centres of the predicted bounding boxes if the method predicts 3 bounding boxes (one for each object), otherwise we discard the prediction. For both methods, we discard any result where position error is greater than 35 px, to be consistent with the evaluation for our method. Finally, this results in 12 position error values (3 objects ×\times× 4 data splits), for each method and each of our 6 datasets.

Appendix D CLEVR Experiments with Different Shapes

To demonstrate that our method applies to objects of any shape, in this section we include experiments on our CLEVR dataset with three distinct objects – a red metallic sphere, a blue rubber cylinder and a green rubber cube. The dataset has objects of size 9-19 px after perspective distortion, and a sample from the dataset is shown in fig. 9(a). We perform training and evaluation in the same way as in app. C.2. We plot the position error as a function of the encoder and decoder receptive field sizes in figs. 9(b) and 9(c), respectively. We can observe that all our experimental data (red) lies within the bounds predicted by our theory (blue), successfully validating our theory for objects with different shapes.

Refer to caption

Refer to caption

(a) Dataset Sample.
Refer to caption
(b) Error vs. Encoder RF.
Refer to caption
(c) Error vs. Decoder RF.
Figure 9: CLEVR experiment results using (a) a dataset with 3 objects of different shapes (a sphere, a cube, and a cylinder), showing position error as a function of (b) the encoder receptive field size sψsubscript𝑠𝜓s_{\psi}italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT and (c) decoder receptive field size sϕsubscript𝑠italic-ϕs_{\phi}italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, as the remaining factors are fixed to sψ=9,sϕ=25formulae-sequencesubscript𝑠𝜓9subscript𝑠italic-ϕ25s_{\psi}=9,s_{\phi}=25italic_s start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = 9 , italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 25, object size so[9,19]subscript𝑠𝑜919s_{o}\in[9,19]italic_s start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ 9 , 19 ] and Gaussian s.d. σG=0.8subscript𝜎𝐺0.8\sigma_{G}=0.8italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.8. Theoretical bounds are denoted by blue, experimental results in red, SAM baseline in green, and CutLER baseline in orange.

Appendix E Experiments with Real Videos

In this section we present experimental results of applying our method to real YouTube videos, including overhead traffic footage and mini pool game footage.

E.1 Traffic Data

In this experiment we aim to learn the position of a car from an overhead traffic video. The training and test sets for this experiment consist of 25 frames each, from a video of an overhead view of a car moving for a short distance in a single lane (fig. 10(a)). We train the architecture from fig. 1 on this training set and validate it on the test set. It achieves a mean squared error between the ground truth object positions and the learned object positions of 7.21057.2superscript1057.2\cdot 10^{-5}7.2 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT (in units normalised by the image size), demonstrating that the object position has been learned successfully with a very low error. We then modify the learned position variable and decode it to generate videos of the car at novel positions (figs. 10(b), 10(c)).

Refer to caption
(a) Training data.
Refer to caption
(b) Generated data (steady speed in a different lane).
Refer to caption
(c) Generated data (lane change and acceleration).
Figure 10: Road traffic video used for training, together with two videos generated after training by modifying and decoding the learned latent variables (video frames are superimposed). The car object is detected successfully and is used to generate realistic videos with objects at unseen positions.

E.2 Mini Pool

In this experiment we aim to learn the positions of balls from a video of a game of mini pool. The training and test sets for this experiment consist of 15 and 11 frames respectively from a video of a game of mini pool, cropped to a portion where two balls are moving at the same time (fig. 11(a)). We train the architecture from fig. 1 on this training set and validate it on the test set, where it achieves a mean squared error between the ground truth object positions and the learned object positions of 8.21038.2superscript1038.2\cdot 10^{-3}8.2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (in normalised units). This demonstrates that the object positions were learned successfully with a very low error. We then modify the learned position variables and decode them to generate videos of the balls at novel positions (figs. 11(b), 11(c)).

In practice, it was important to set the encoder and decoder receptive field sizes to be greater than but close to the size of the objects, as for larger RF sizes the position error increased unnecessarily and the images were rendered further away from the position given by the latent variables. Also, for large receptive field sizes, the decoder filter contained too much background which caused low quality reconstructions when rendering the balls near the mini pool table edges.

Refer to caption
(a) Training data.
Refer to caption
(b) Generated data (linear motion at unseen positions).
Refer to caption
(c) Generated data (collision and slowing down).
Figure 11: Mini pool video used for training, together with two videos generated after training by modifying and decoding the learned latent variables (video frames are superimposed). Both ball objects are detected successfully and are used to generate realistic videos.