Abstract
The performance of traditional graph Laplacian methods for semi-supervised learning degrades substantially as the ratio of labeled to unlabeled data decreases, due to a degeneracy in the graph Laplacian. Several approaches have been proposed recently to address this, however we show that some of them remain ill-posed in the large-data limit. In this paper, we show a way to correctly set the weights in Laplacian regularization so that the estimator remains well posed and stable in the large-sample limit. We prove that our semi-supervised learning algorithm converges, in the infinite sample size limit, to the smooth solution of a continuum variational problem that attains the labeled values continuously. Our method is fast and easy to implement.
Similar content being viewed by others
References
Ajtai, M., Komlós, J., Tusnády, G.: On optimal matchings. Combinatorica 4(4), 259–264 (1984)
Alamgir, M., Luxburg, U.V.: Phase transition in the family of p-resistances. In: Advances in Neural Information Processing Systems, pp. 379–387 (2011)
Ando, R.K., Zhang, T.: Learning on graph with laplacian regularization. Adv. Neural Inf. Process. Syst. 19, 25 (2007)
Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Springer, New York (2008)
Belkin, M., Niyogi, P.: Using manifold structure for partially labeled classification. In: Advances in Neural Information Processing Systems (NIPS), pp. 953–960 (2003)
Bertozzi, A., Luo, X., Stuart, A., Zygalakis, K.: Uncertainty quantification in graph-based classification of high dimensional data. SIAM/ASA J. Uncertain. Quantif. 6(2), 568–595 (2018)
Braides, A.: \(\Gamma \)-Convergence for Beginners. Oxford Lecture Series in Mathematics and Its Applications, vol. 22. Oxford University Press, Oxford (2002)
Brandt, A., Livne, O.E.: Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, vol. 67. SIAM, Philadelphia (2011)
Bridle, N., Zhu, X.: p-voltages: laplacian regularization for semi-supervised learning on high-dimensional data. In: Eleventh Workshop on Mining and Learning with Graphs (MLG2013) (2013)
Calder, J.: The game theoretic p-Laplacian and semi-supervised learning with few labels. Nonlinearity 32(1), 301–330 (2018)
Calder, J.: Consistency of Lipschitz learning with infinite unlabeled data and finite labeled data. To appear in SIAM Journal on Mathematics of Data Science (2019)
Calder, J., Smart, C.K.: The limit shape of convex hull peeling. arXiv preprint arXiv:1805.08278 (2018)
Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning. MIT, London (2006)
Costa, J.A., Hero, A.O.: Determining intrinsic dimension and entropy of high-dimensional shape spaces. In: Statistics and Analysis of Shapes, pp. 231–252. Springer, New York (2006)
Dal Maso, G.: An introduction to \(\Gamma \)-Convergence. Progress in Nonlinear Differential Equations and Their Applications, vol. 8. Birkhäuser Boston Inc, Boston (1993)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009., pp. 248–255. IEEE (2009)
Dunlop, M.M., Slepčev, D., Stuart, A.M., Thorpe, M.: Large data and zero noise limits of graph-based semi-supervised learning algorithms. Appl. Comput. Harmon. Anal. (2019). https://doi.org/10.1016/j.acha.2019.03.005
El Alaoui, A., Cheng, X., Ramdas, A., Wainwright, M.J., Jordan, M.I.: Asymptotic behavior of lp-based Laplacian regularization in semi-supervised learning. In: 29th Annual Conference on Learning Theory, pp. 879–906 (2016)
García Trillos, N., Slepčev, D.: On the rate of convergence of empirical measures in \(\infty \)-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)
García Trillos, N., Slepčev, D.: Continuum limit of total variation on point clouds. Arch. Ration. Mech. Anal. 220(1), 193–241 (2016)
García Trillos, N., Slepčev, D.: A variational approach to the consistency of spectral clustering. Appl. Comput. Harmon. Anal. 45(2), 239–281 (2018)
García Trillos, N., Slepčev, D., von Brecht, J., Laurent, T., Bresson, X.: Consistency of Cheeger and ratio graph cuts. J. Mach. Learn. Res. 17(1), 6268–6313 (2016)
García Trillos, N., Gerlach, M., Hein, M., Slepčev, D.: Error estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace-Beltrami operator. Found. Comput. Math. (2019). https://doi.org/10.1007/s10208-019-09436-w
Gilbarg, D., Trudinger, N.S.: Elliptic Partial Differential Equations of Second Order. Springer, New York (2015)
Greenbaum, A.: Iterative Methods for Solving Linear Systems, vol. 17. SIAM, Philadelphia (1997)
He, J., Li, M., Zhang, H.-J., Tong, H., Zhang, C.: Manifold-ranking based image retrieval. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, pp. 9–16. ACM (2004)
He, J., Li, M., Zhang, H.-J., Tong, H., Zhang, C.: Generalized manifold-ranking-based image retrieval. IEEE Trans. Image Process. 15(10), 3170–3177 (2006)
Hein, M., Audibert, J.-Y.: Intrinsic dimensionality estimation of submanifolds in Rd. In: Proceedings of the 22nd International Conference on Machine learning, pp. 289–296. ACM (2005)
Kyng, R., Rao, A., Sachdeva, S., Spielman, D.A.: Algorithms for lipschitz learning on graphs. In: Proceedings of The 28th Conference on Learning Theory, pp. 1190–1223 (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Leighton, T., Shor, P.: Tight bounds for minimax grid matching with applications to the average case analysis of algorithms. Combinatorica 9(2), 161–187 (1989)
Leoni, G.: A First Course in Sobolev Spaces, vol. 181. American Mathematical Society, Providence (2017)
Luxburg, U.V., Bousquet, O.: Distance-based classification with lipschitz functions. J. Mach. Learn. Res. 5(Jun), 669–695 (2004)
Nadler, B., Srebro, N., Zhou, X.: Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data. In: Neural Information Processing Systems (NIPS) (2009)
Rudin, W.: Real and Complex Analysis. Tata McGraw-Hill Education, New York (2006)
Ruge, J.W., Stüben, K.: Algebraic multigrid. In: Multigrid Methods, pp. 73–130. SIAM (1987)
Santambrogio, F.: Optimal Transport for Applied Mathematicians, Progress in Nonlinear Differential Equations and their Applications Calculus of variations, PDEs, and modeling, vol. 87. Birkhäuser, Cham (2015)
Shi, Z., Osher, S., Zhu, W.: Weighted nonlocal laplacian on interpolation from sparse data. J. Sci. Comput. 73(2–3), 1164–1177 (2017)
Shor, P.W., Yukich, J.E.: Minimax grid matching and empirical measures. Ann. Probab. 19(3), 1338–1348 (1991)
Singer, A.: From graph to manifold Laplacian: the convergence rate. Appl. Comput. Harmon. Anal. 21(1), 128–134 (2006)
Slepčev, D., Thorpe, M.: Analysis of p-Laplacian regularization in semi-supervised learning. SIAM J. Math. Anal. 51(3), 2085–2120 (2019)
Spielman, D.A., Teng, S.-H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 81–90. ACM (2004)
Talagrand, M.: Upper and Lower Bounds of Stochastic Processes. Modern Surveys in Mathematics, vol. 60. Springer, Berlin (2014)
Thorpe, M., Park, S., Kolouri, S., Rohde, G.K., Slepčev, D.: A transportation \(L^p\) distance for signal analysis. J. Math. Imaging Vis. 59(2), 187–210 (2017)
Villani, C.: Optimal transport, Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338. Springer, Berlin (2009)
Wang, Y., Cheema, M.A., Lin, X., Zhang, Q.: Multi-manifold ranking: Using multiple features for better image retrieval. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 449–460. Springer, New York (2013)
Xu, B., Bu, J., Chen, C., Cai, D., He, X., Liu, W., Luo, J.: Efficient manifold ranking for image retrieval. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 525–534. ACM (2011)
Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.-H.: Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3166–3173 (2013)
Zhou, X., Belkin, M.: Semi-supervised learning by higher order regularization. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 892–900 (2011)
Zhou, D., Schölkopf, B.: Regularization on discrete spaces. In: Proceedings of the 27th DAGM Conference on Pattern Recognition, PR’05, pp. 361–368. Springer, Berlin (2005)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)
Zhou, D., Weston, J., Gretton, A., Bousquet, O., Schölkopf, B.: Ranking on data manifolds. Adv. Neural Inf. Process. Syst. 16, 169–176 (2004)
Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 1036–1043. ACM (2005)
Zhou, X., Belkin, M., Srebro, N.: An iterated graph laplacian approach for ranking on manifolds. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 877–885. ACM (2011)
Zhu, X., Ghahramani, Z., Lafferty, J., et al.: Semi-supervised learning using Gaussian fields and harmonic functions. Int. Conf. Mach. Learn. 3, 912–919 (2003)
Acknowledgements
Calder was supported by NSF Grant DMS:1713691. Slepčev acknowledges the NSF support (Grants DMS-1516677 and DMS-1814991). He is grateful to University of Minnesota, where this project started, for hospitality. He is also grateful to the Center for Nonlinear Analysis of CMU for its support. Both authors thank the referees for valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Background Material
Appendix A: Background Material
Here we recall some of the notions our work depends on and establish an auxiliary technical result.
1.1 \(\Gamma \)–Convergence
\(\Gamma \)-convergence was introduced by De Giorgi in 1970s to study limits of variational problems. We refer to [7, 15] for comprehensive introduction to \(\Gamma \)-convergence. We now recall the notion of \(\Gamma \)-convergence is in a random setting.
Definition A.1
(\(\varGamma \)-convergence) Let (Z, d) be a metric space, \(L^0(Z;{\mathbb {R}}\cup \{\pm \infty \})\) be the set of measurable functions from Z to \({\mathbb {R}}\cup \{\pm \infty \}\), and \(({{\mathcal {X}}},{{\mathbb {P}}})\) be a probability space. The function \({{\mathcal {X}}}\ni \omega \mapsto {{\mathcal {E}}}_n^{(\omega )} \in L^0(Z;{\mathbb {R}}\cup \{\pm \infty \})\) is a random variable. We say \({{\mathcal {E}}}_n^{(\omega )}\) \(\Gamma \)-converge almost surely on the domain Z to \({{\mathcal {E}}}_\infty :Z\rightarrow {\mathbb {R}}\cup \{\pm \infty \}\) with respect to d, and write \({{\mathcal {E}}}_\infty = {{\,\mathrm{\Gamma \text {-}\lim }\,}}_{n \rightarrow \infty }{{\mathcal {E}}}_n^{(\omega )}\), if there exists a set \({{\mathcal {X}}}^\prime \subset {{\mathcal {X}}}\) with \({{\mathbb {P}}}({{\mathcal {X}}}^\prime ) = 1\), such that for all \(\omega \in {{\mathcal {X}}}^\prime \) and all \(f\in Z\):
-
(i)
(liminf inequality) for every sequence \(\{u_n\}_{n=1,\ldots }\) in Z converging to f
$$\begin{aligned} {{\mathcal {E}}}_\infty (f) \leqslant \liminf _{n\rightarrow \infty }{{\mathcal {E}}}_n^{(\omega )}(u_n), \text { and } \end{aligned}$$ -
(ii)
(recovery sequence) there exists a sequence \(\{u_n\}_{n=1,2,\ldots }\) in Z converging to f such that
$$\begin{aligned} {{\mathcal {E}}}_\infty (f) \geqslant \limsup _{n\rightarrow \infty }{{\mathcal {E}}}_n^{(\omega )}(u_n). \end{aligned}$$
For simplicity we suppress the dependence of \(\omega \) in writing our functionals. The almost sure nature of the convergence in our claims in ensured by considering the set of realizations of \(\{x_i\}_{i=1,\ldots }\) such that the conclusions of Theorem A.3 hold (which they do almost surely).
An important result concerning \(\Gamma \)-convergence is that any subsequential limit of the sequence of minimizers of \({{\mathcal {E}}}_n\) is a minimizer of the limiting functional \({{\mathcal {E}}}_\infty \). So to show that the minimizers of \({{\mathcal {E}}}_n\) converge at least along a subsequence to a minimizer of \({{\mathcal {E}}}_\infty \)it suffices to establish the precompactness of the set of minimizers. We make this precise in the theorem below. Its proof can be found in [7, Theorem 1.21] or [15, Theorem 7.23].
Theorem A.2
(Convergence of Minimizers) Let (Z, d) be a metric space and \({{\mathcal {E}}}_n: Z\rightarrow [0,\infty ]\) be a sequence of functionals. Let \(u_n\) be a minimizing sequence for \({{\mathcal {E}}}_n\). If the set \(\{u_n\}_{n=1,2,\ldots }\) is precompact and \({{\mathcal {E}}}_\infty = {{\,\mathrm{\Gamma \text {-}\lim }\,}}_n{{\mathcal {E}}}_n\) where \({{\mathcal {E}}}_\infty :Z\rightarrow [0,\infty ]\) is not identically \(\infty \) then
Furthermore any cluster point of \(\{u_n\}_{n=1,2,\ldots }\) is a minimizer of \(E_\infty \).
The theorem is also true if we replace minimizers with approximate minimizers.
We note that \(\Gamma \)-convergence is defined for functionals on a common metric space. Section A.3 overviews the metric space we use to analyze the asymptotics of our semi-supervised learning models, in particular it allows us to go from discrete to continuum.
1.2 Optimal Transportation and Approximation of Measures
Here we recall the notion of optimal transportation between measures and the metric it introduces. Comprehensive treatment of the topic can be found in books of Villani [45] and Santambrogio [37].
Given a bounded, open set \(\Omega \subset {\mathbb {R}}^d\), and probability measures \(\mu \) and \(\nu \) in \({\mathcal {P}}({\overline{\Omega }})\) we define the set \(\Pi (\mu ,\nu )\) of transportation plans, or couplings, between \(\mu \) and \(\nu \) to be the set of probability measures on the product space \(\pi \in {\mathcal {P}}({\overline{\Omega }} \times \overline{\Omega })\) whose first marginal is \(\mu \) and second marginal is \(\nu \). We then define the p-optimal transportation distance (a.k.a. p-Wasserstein distance) by
If \(\mu \) is absolutely continuous with respect to the Lebesgue measure on \(\Omega \), then the distance can be rewritten using transportation maps, \(T: \Omega \rightarrow \Omega \), instead of transportation plans,
where \(T_{\#}\mu =\nu \) means that the push forward of the measure \(\mu \) by T is the measure \(\nu \), namely that T is Borel measurable and such that for all \(U\subset \overline{\Omega }\), open, \(\mu (T^{-1}(U)) = \nu (U)\).
When \(p< \infty \) the metric \(d_p\) metrizes the \(\text {weak}^*\) convergence of measures.
Optimal transportation plays an important role in comparing the discrete and continuum objects we study. In particular, we use sharp estimates on the \(\infty \)-optimal transportation distance between a measure and the empirical measure of its sample. In the form below, for \(d \geqslant 2\), they were established in [19], which extended the related results in [1, 31, 39, 43].
Theorem A.3
For \(d \geqslant 2\), let \(\Omega \subset {\mathbb {R}}^d\) be open, connected and bounded with Lipschitz boundary. Let \(\mu \) be a probability measure on \(\Omega \) with density (with respect to Lebesgue) \(\rho \) which is bounded above and below by positive constants. Let \(x_1,x_2,\ldots \) be a sequence of independent random variables with distribution \(\mu \) and let \(\mu _n\) be the empirical measure. Then, there exists constants \(C \geqslant c > 0\) such that almost surely there exists a sequence of transportation maps \(\{T_n\}_{n=1}^\infty \) from \(\mu \) to \(\mu _n\) with the property
where
1.3 The \(TL^p\) Space
The discrete functionals we consider (e.g. \({{\mathcal {E}}}_{n, \varepsilon _n, \zeta _n}\)) are defined for functions \(u_n : {{\mathcal {X}}}_n \rightarrow {\mathbb {R}}\), while the limit functional \({{\mathcal {E}}}\) acts on functions \(f:\Omega \rightarrow {\mathbb {R}}\), where \(\Omega \) is an open set. We can view \(u_n\) as elements of \(L^p(\mu _n)\) where \(\mu _n\) is the empirical measure of the sample \(\mu _n = \frac{1}{n} \sum _{i=1}^n \delta _{x_i}\). Likewise \(f \in L^p(\mu )\) where \(\mu \) is the measure with density \(\rho \) from which the data points are sampled. In order to compare f and \(u_n\) in a way that is consistent with the \(L^p\) topology we use the \(TL^p\) space that was introduced in [20], where it was used to study the continuum limit of the graph total variation. Subsequent development of the \(TL^p\) space has been carried out in [21, 44].
To compare the functions \(u_n\) and f above we need to take into account their domains, or more precisely to account for \(\mu \) and \(\mu _n\). For that purpose the space of configurations is defined to be
The metric on the space is
where \(\Pi (\mu ,\nu ) \) the set of transportation plans defined in Sect. A.2. We note that the minimizing \(\pi \) exists and that \(TL^p\) space is a metric space, [20].
As shown in [20], when \(\mu \) is absolutely continuous with respect to the Lebesgue measure on \(\Omega \), then the distance can be rewritten using transportation maps T, instead of transportation plans,
where the push forward of the measure \(T_{\#} \mu \) is defined in Sect. A.2. This formula provides an interpretation of the distance in our setting. Namely, to compare functions \(u_n: {{\mathcal {X}}}_n\rightarrow {\mathbb {R}}\) we define a mapping \(T_n:\Omega \rightarrow {{\mathcal {X}}}_n\) and compare the functions \(\widetilde{f}_n = u_n\circ T_n\) and f in \(L^p(\mu )\), while also accounting for the transport, namely the \(|x - T_n(x)|^p\) term.
We remark that the \(TL^p({\overline{\Omega }})\) space is not complete, and that its completion was discussed in [20]. In the setting of this paper, since the corresponding measure is clear from context, we often say that \(u_n\) converges in \(TL^p\) to f as a short way to say that \((\mu _n, u_n)\) converges in \(TL^p\) to \((\mu ,f)\).
1.4 Local Estimates for Weighted Laplacian
Lemma A.4
There exists \(C>0\) such that for each \( u \in H^1(B(0,1))\) there exists \( v \in H^1(B(0,1))\) such that
where the value on the boundary is considered in sense of the \(L^2(\partial B(0,1))\) trace.
Proof
Let
Let \(\phi \in C^\infty ([0, 1] , [0,1])\) be such that \(\phi (r) = 1\) for all \(r \in [0, \frac{1}{2}]\) and \(\phi (1)=0\). Let \(M= \max _{r \in [0,1]} |\phi '(r)|\). Let
By Poincaré inequality stated in Theorem 13.27 of [32] there exists \(C_1 >0\), independent of u,
Using the Poincaré inequality we obtain, for \(C =2+2MC_1\),
\(\square \)
Rights and permissions
About this article
Cite this article
Calder, J., Slepčev, D. Properly-Weighted Graph Laplacian for Semi-supervised Learning. Appl Math Optim 82, 1111–1159 (2020). https://doi.org/10.1007/s00245-019-09637-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00245-019-09637-3