Abstract
Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Although failing to satisfy the separation property (2.i) can have serious practical consequences, recall that a pseudodistance always becomes a full fledged distance on the quotient space \(\mathcal {X}/\mathcal {R}\) where \(\mathcal {R}\) denotes the equivalence relation \(x\mathcal {R}y\Leftrightarrow {d(x,y)} = 0\). All the theory applies as long as one never distinguishes two points separated by a zero distance.
- 2.
We use the notation to denote the probability distribution obtained by applying function f or expression f(x) to samples of the distribution \(\mu \).
- 3.
Stochastic gradient descent often relies on unbiased gradient estimates (for a more general condition, see [10, Assumption 4.3]). This is not a given: estimating the Wasserstein distance (14) and its gradients on small minibatches gives severely biased estimates [7]. This is in fact very obvious for minibatches of size one. Theorem 2.1 therefore provides an imperfect but useful alternative.
- 4.
The statement holds when there is an \(M > 0\) such that \(\mu \{x:|\textit{f}\,(q(x)/p(x))| > M\} = 0\) Restricting \(\mu \) to exclude such subsets and taking the limit \(M\rightarrow \infty \) may not work because \(\lim \sup \ne \sup \lim \) in general. Yet, in practice, the result can be verified by elementary calculus for the usual choices of \(\textit{f}\), such as those shown in Table 1.
- 5.
We take the square root because this is the quantity that behaves like a distance.
- 6.
The curious reader can pick an expression of \(F_d(t)=P\{\Vert x-x_i\Vert <t\}\) in [23], then derive an asymptotic bound for \(P\{\min _i\Vert x-x_i\Vert <t\}=1-(1-F_d(t))^n\).
- 7.
Note that it is then important to use the \(\log (D)\) trick succinctly discussed in the original GAN paper [20].
- 8.
See [54] for the relation between Energy Distance and Cramér distance.
- 9.
For instance the set of probability measures on \(\mathbb {R}\) equipped with the total variation distance (6) is not separable because any dense subset needs one element in each of the disjoint balls \(B_x=\{\,P{\in }\mathcal {P}_{\!{\mathbb {R}}}:D_{TV}(P,\delta _x) < 1/2\,\}\).
- 10.
References
Aizerman, M.A., Braverman, É.M., Rozonoér, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)
Amari, S.I., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society (2007)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 7–9 August 2017
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Mathe. Soc. 68, 337–404 (1950)
Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573 (2017)
Auffinger, A., Ben Arous, G.: Complexity of random smooth functions of many variables. Ann. Probab. 41(6), 4214–4247 (2013)
Bellemare, M.G., et al.: The cramer distance as a solution to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)
Berti, P., Pratelli, L., Rigo, P., et al.: Gluing lemmas and skorohod representations. Electron. Commun. Probab. 20 (2015)
Borkar, V.S.: Stochastic approximation with two time scales. Syst. Control Lett. 29(5), 291–294 (1997)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. CoRR abs/1606.04838 (2016)
Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: DISCO nets: DISsimilarity cOefficients networks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 352–360 (2016)
Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry. Volume 33 of AMS Graduate Studies in Mathematics, American Mathematical Society (2001)
Challis, E., Barber, D.: Affine independent variational inference. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2186–2194. Curran Associates, Inc. (2012)
Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1486–1494. Curran Associates, Inc. (2015)
Dereich, S., Scheutzow, M., Schottstedt, R.: Constructive quantization: approximation by empirical measures. Annales de l’I.H.P. Probabilités et statistiques 49(4), 1183–1203 (2013)
Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI, pp. 258–267 (2015)
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theor. Relat. Fields 162(3), 707–738 (2015)
Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540 (2016)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017)
Hammersley, J.M.: The distribution of distance in a hypersphere. Ann. Mathe. Stat. 21(3), 447–452 (1950)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2009)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Khinchin, A.Y.: Sur la loi des grandes nombres. Comptes Rendus de l’Académie des Sciences (1929)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)
Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023 (2017)
Konda, V.R., Tsitsiklis, J.N.: Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 796–819 (2004)
Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: A probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, CVPR 2015, pp. 4390–4399 (2015)
Lee, M.W., Nevatia, R.: Dynamic human pose estimation using Markov Chain Monte Carlo approach. In: 7th IEEE Workshop on Applications of Computer Vision/IEEE Workshop on Motion and Video Computing (WACV/MOTION 2005), pp. 168–175 (2005)
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584 (2017)
Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 1718–1727 (2015)
Liu, S., Bousquet, O., Chaudhuri, K.: Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991 (2017). to appear in NIPS 2017
Milgrom, P., Segal, I.: Envelope theorems for arbitrary choice sets. Econometrica 70(2), 583–601 (2002)
von Mises, R.: On the asymptotic distribution of differentiable statistical functions. Ann. Mathe. Stat. 18(3), 309–348 (1947)
Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)
Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)
Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor. 56(11), 5847–5861 (2010)
Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, vol. 29, pp. 271–279 (2016)
Rachev, S.T., Klebanov, L., Stoyanov, S.V., Fabozzi, F.: The Methods of Distances in the Theory of Probability and Statistics. Springer, New York (2013)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pp. 1278–1286 (2014)
Romaszko, L., Williams, C.K., Moreno, P., Kohli, P.: Vision-as-inverse-graphics: obtaining a rich 3D explanation of a scene from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 851–859 (2017)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)
Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Am. Mathe. Soc. 44, 522–536 (1938)
Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)
Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013)
Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York; Chichester (1980)
Sriperumbudur, B.: On the optimal estimation of probability measures in weak and strong topologies. Bernoulli 22(3), 1839–1893 (2016)
Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G.R.: On the empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599 (2012)
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.: Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12, 2389–2410 (2011)
Székely, G.J., Rizzo, M.L.: Energy statistics: a class of statistics based on distances. J. Stat. Plan. Infer. 143(8), 1249–1272 (2013)
Székely, J.G.: E-statistics: The energy of statistical samples. Technical report, 02–16, Bowling Green State University, Department of Mathematics and Statistics (2002)
Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (2016)
Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)
Zinger, A.A., Kakosyan, A.V., Klebanov, L.B.: A characterization of distributions by mean values of statistics and certain probabilistic metrics. J. Sov. Mathe. 4(59), 914–920 (1992). Translated from Problemy Ustoichivosti Stokhasticheskikh Modelei-Trudi seminara, pp. 47–55 (1989)
Acknowledgements
We would like to thank Joan Bruna, Marco Cuturi, Arthur Gretton, Yann Ollivier, and Arthur Szlam for stimulating discussions and also for pointing out numerous related works.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Bottou, L., Arjovsky, M., Lopez-Paz, D., Oquab, M. (2018). Geometrical Insights for Implicit Generative Modeling. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Lecture Notes in Computer Science(), vol 11100. Springer, Cham. https://doi.org/10.1007/978-3-319-99492-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-99492-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99491-8
Online ISBN: 978-3-319-99492-5
eBook Packages: Computer ScienceComputer Science (R0)