×

Chain rule optimal transport. (English) Zbl 1473.49052

Nielsen, Frank (ed.), Progress in information geometry. Theory and applications. Based on discussions of the fourth international conference on geometric science of information, GSI’2019, Toulouse, France, August 27–29, 2019. Cham: Springer. Signals Commun. Technol., 191-217 (2021).
Summary: We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced metric distance between statistical mixtures, and provide an upper bound for jointly convex distances between statistical mixtures. By entropic regularization of the optimal transport, we obtain a fast differentiable Sinkhorn-type distance. We experimentally evaluate our new family of distances by quantifying the upper bounds of several jointly convex distances between statistical mixtures, and by proposing a novel efficient method to learn Gaussian mixture models (GMMs) by simplifying kernel density estimators with respect to our distance. Our GMM learning technique experimentally improves significantly over the EM implementation of sklearn on the MNIST and Fashion MNIST datasets.
For the entire collection see [Zbl 1473.53007].

MSC:

49Q22 Optimal transportation

References:

[1] Xie, L., Ugrinovskii, V.A., Petersen, I.R.: Probabilistic distances between finite-state finite-alphabet hidden Markov models. IEEE Trans. Autom. Control. 50(4), 505 · Zbl 1365.60066 · doi:10.1109/TAC.2005.844896
[2] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Technical report, Zalando Research, Berlin, Germany (2017). arXiv:cs.LG/1708.07747
[3] Vigelis, R.F., De Andrade, L.H., Cavalcante, C.C.: Properties of a generalized divergence related to Tsallis generalized divergence. IEEE Trans. Inf. Theory 66(5), 2891-2897 (2019) · Zbl 1448.94120 · doi:10.1109/TIT.2019.2953029
[4] Vaserstein, L.N.: Markov processes over denumerable products of spaces, describing large systems of automata. Probl. Peredachi Informatsii 5(3), 64-72 (1969) · Zbl 0273.60054
[5] Van Erven, T., Harremos, P.: Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797-3820 (2014) · Zbl 1360.94180 · doi:10.1109/TIT.2014.2320500
[6] Takatsu, A., et al.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005-1026 (2011) · Zbl 1245.60013
[7] Singer, Y., Warmuth, M.K.: Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy. In: NIPS 578-584 (1999)
[8] Silva, J., Narayanan, S.: Upper bound Kullback-Leibler divergence for hidden Markov models with application as discrimination measure for speech recognition. In: IEEE International Symposium on Information Theory (ISIT), pp. 2299-2303. IEEE (2006)
[9] Schwander, O., Nielsen, F.: Learning mixtures by simplifying kernel density estimators. In: Matrix Information Geometry, pp. 403-426. Springer (2013) · Zbl 1269.94015
[10] Santambrogio, F.: Optimal Transport for Applied Mathematicians, pp. 99-102. Birkäuser, NY (2015) · Zbl 1401.49002 · doi:10.1007/978-3-319-20828-2
[11] Rüschendorf, L.: The Wasserstein distance and approximation theorems. Probab. Theory Relat. Fields 70, 117-129 (1985) · Zbl 0554.60024 · doi:10.1007/BF00532240
[12] Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99-121 (2000) · Zbl 1012.68705 · doi:10.1023/A:1026543900054
[13] Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10(1-3), 19-41 (2000) · doi:10.1006/dspr.1999.0361
[14] Pitrik, J., Virosztek, D.: On the joint convexity of the Bregman divergence of matrices. Lett. Math. Phys. 105(5), 675-692 (2015) · Zbl 1330.47026 · doi:10.1007/s11005-015-0757-y
[15] Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® in Mach. Learn. 11(5-6), 355-607 (2019)
[16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825-2830 (2011) · Zbl 1280.68189
[17] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825-2830 (2011) · Zbl 1280.68189
[18] Ozawa, R., Yokota, T.: Stability of RCD condition under concentration topology. J. Phys. A: Math. Theor. 45(3), 032003 (2011) · Zbl 1420.53052
[19] Österreicher, F., Vajda, I.: A new class of metric divergences on probability spaces and its applicability in statistics. Ann. Inst. Stat. Math. 55(3), 639-653 (2003) · Zbl 1052.62002 · doi:10.1007/BF02517812
[20] Nielsen, M.A., Chuang, I.: Quantum computation and quantum information (2002) · Zbl 1049.81015
[21] Nielsen, F., Sun, K.: Guaranteed deterministic bounds on the total variation distance between univariate mixtures. IEEE Mach. Learn. Signal Process. (MLSP) 1-6 (2018)
[22] Nielsen, F., Sun, K.: Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 18(12), 442 (2016) · doi:10.3390/e18120442
[23] Nielsen, F., Nock, R.: On \(w\)-mixtures: finite convex combinations of prescribed component distributions (2017). CoRR arXiv:abs/1708.00568
[24] Nielsen, F., Nock, R.: On the chi square and higher-order chi distances for approximating \(f\)-divergences. IEEE Signal Process. Lett. 21(1), 10-13 (2014) · doi:10.1109/LSP.2013.2288355
[25] Nielsen, F., Nock, R.: On Rényi and Tsallis entropies and divergences for exponential families (2011). arXiv:1105.3259
[26] Nielsen, F., Nock, R.: A closed-form expression for the Sharma-Mittal entropy of exponential families. J. Phys. A: Math. Theor. 45(3), 032003 (2011) · Zbl 1235.81039
[27] Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards (2009). arXiv:0911.4863
[28] Nielsen, F.: The statistical Minkowski distances: closed-form formula for Gaussian mixture models (2019). arXiv:1901.03732 · Zbl 1458.62012
[29] Nielsen, F.: Generalized Bhattacharyya and Chernoff upper bounds on bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 42, 25-34 (2014) · doi:10.1016/j.patrec.2014.01.002
[30] Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 1723-1726. IEEE (2012)
[31] Nielsen, F.: A family of statistical symmetric divergences based on Jensen’s inequality (2010). arXiv:1009.4004
[32] Monge, G.: Mémoire sur la théorie des déblais et des remblais. Imprimerie Royale (1781)
[33] Liu, Z., Huang, Q.: A new distance measure for probability distribution function of mixture type. In: ICASSP, vol. 1, pp. 616-619. IEEE (2000)
[34] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278-2324 (1998) · doi:10.1109/5.726791
[35] Korte, B., Vygen, J.: Linear programming algorithms. In: Combinatorial Optimization, pp. 75-102. Springer (2018) · Zbl 1390.90001
[36] Komaki, F.: Bayesian prediction based on a class of shrinkage priors for location-scale models. Ann. Inst. Stat. Math. 59(1), 135-146 (2007) · Zbl 1108.62026 · doi:10.1007/s10463-006-0102-4
[37] Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)
[38] Khosravifard, M., Fooladivanda, D., Gulliver, T.A.: Confliction of the convexity and metric properties in \(f\)-divergences. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 90(9), 1848-1853 (2007) · doi:10.1093/ietfec/e90-a.9.1848
[39] Kantorovitch, L.: On the translocation of masses. Manag. Sci. 5(1), 1-4 (1958) · Zbl 0995.90585 · doi:10.1287/mnsc.5.1.1
[40] Kantorovich, L.: On the transfer of masses. Doklady Akademii Nauk 37(2), 227-229 (1942). (in Russian)
[41] Hershey, J.R., Olsen, P.A.: Approximating the Kullback-Leibler divergence between Gaussian mixture models. In: ICASSP, vol. 4, pp. IV-317. IEEE (2007)
[42] Goldberger, J., Gordon, S., Greenspan, H.: An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In: IEEE International Conference on Computer Vision (ICCV), p. 487. IEEE (2003)
[43] Goldberger, J., Aronowitz, H.: A distance measure between GMMs based on the unscented transform and its application to speaker recognition. In: INTERSPEECH European Conference on Speech Communication and Technology, pp. 1985-1988 (2005)
[44] Ghaffari, N., Walker, S.: On multivariate optimal transportation (2018)
[45] Gelbrich, M.: On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten 147(1), 185-203 (1990) · Zbl 0711.60003 · doi:10.1002/mana.19901470121
[46] Gangbo, W., McCann, R.J.: The geometry of optimal transportation. Acta Math. 177(2), 113-161 (1996) · Zbl 0887.49017 · doi:10.1007/BF02392620
[47] Fuglede, B., Topsoe, F.: Jensen-Shannon divergence and Hilbert space embedding. In: International Symposium on Information Theory (ISIT 2004), p. 31. IEEE (2004)
[48] Flamary, R., Courty, N.: POT python optimal transport library (2017)
[49] Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-I., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences (2018). arXiv:1810.08278
[50] Everett, B.: An Introduction to Latent Variable Models. Springer Science & Business Media (2013)
[51] Durrieu, J.-L., Thiran, J.-P., Kelly, F.: Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4833-4836. IEEE (2012)
[52] Dragomir, S.S.: Inequalities for Csiszár f-divergence in information theory. Victoria University, Melbourne, Australia (2000)
[53] Dowson, D.C., Landau, B.: The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 12(3), 450-455 (1982) · Zbl 0501.62038 · doi:10.1016/0047-259X(82)90077-X
[54] Do, M.N.: Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE Signal Process. Lett. 10(4), 115-118 (2003) · doi:10.1109/LSP.2003.809034
[55] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.), pp. 1-38 (1977) · Zbl 0364.62022
[56] Dacorogna, B., Maréchal, P.: The role of perspective functions in convexity, polyconvexity, rank-one convexity and separate convexity. J. Convex Anal. 15(2), 271 (2008) · Zbl 1152.26012
[57] Cuturi, M., Teboul, O., Vert, J.: Differentiable sorting using optimal transport: the Sinkhorn CDF and quantile operator (2019). CoRR arXiv:abs/1905.11885
[58] Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS, pp. 2292-2300 (2013)
[59] Chen, Y., Georgiou, T.T., Tannenbaum, A.: Optimal transport for Gaussian mixture models. IEEE Access 7, 6269-6278 (2019) · doi:10.1109/ACCESS.2018.2889838
[60] Chang, K.-C., Sun, W.: Scalable fusion with mixture distributions in sensor networks. In: 11th International Conference on Control Automation Robotics & Vision (ICARCV), pp. 1251-1256 (2010)
[61] Borwein, J.M., Vanderwerff, J.D.: Convex Functions: Constructions, Characterizations and Counterexamples, vol. 109. Cambridge University Press, Cambridge (2010) · Zbl 1191.26001 · doi:10.1017/CBO9781139087322
[62] Bonneel, N., Rabin, J., Peyré, G., Pfister, H.: Sliced and radon Wasserstein barycenters of measures. J. Math. Imaging Vis. 51(1), 22-45 (2015) · Zbl 1332.94014 · doi:10.1007/s10851-014-0506-3
[63] Bauschke, H.H., Borwein, J.M.: Joint and separate convexity of the Bregman distance. In: Studies in Computational Mathematics, vol. 8, pp. 23-36. Elsevier (2001) · Zbl 1160.65319
[64] Amari, S.-I.: Information Geometry and Its Applications. Applied Mathematical Sciences. Springer, Japan (2016) · Zbl 1350.94001 · doi:10.1007/978-4-431-55978-8
[65] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265-283 (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.