×

Landmark diffusion maps (L-dMaps): accelerated manifold learning out-of-sample extension. (English) Zbl 1489.68228

Summary: Diffusion maps are a nonlinear manifold learning technique based on harmonic analysis of a diffusion process over the data. Out-of-sample extensions with computational complexity \(\mathcal{O}(N)\), where \(N\) is the number of points comprising the manifold, frustrate applications to online learning applications requiring rapid embedding of high-dimensional data streams. We propose landmark diffusion maps (L-dMaps) to reduce the complexity to \(\mathcal{O}(M)\), where \(M \ll N\) is the number of landmark points selected using pruned spanning trees or k-medoids. Offering \((N / M)\) speedups in out-of-sample extension, L-dMaps enable the application of diffusion maps to high-volume and/or high-velocity streaming data. We illustrate our approach on three datasets: the Swiss roll, molecular simulations of a C\(_{24}\)H\(_{50}\) polymer chain, and biomolecular simulations of alanine dipeptide. We demonstrate up to 50-fold speedups in out-of-sample extension for the molecular systems with less than 4% errors in manifold reconstruction fidelity relative to calculations over the full dataset.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
05C50 Graphs and linear algebra (matrices, eigenvalues, etc.)
42B35 Function spaces arising in harmonic analysis
60J60 Diffusion processes
62R30 Statistics on manifolds
68Q25 Analysis of algorithms and problem complexity
92-10 Mathematical modeling or simulation for problems pertaining to biology

References:

[1] Cho, M.; Lee, J.; Lee, K. M., Reweighted random walks for graph matching, (11th European Conference on Computer Vision (2010)), 492-505
[2] Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J., Application of dimensionality reduction in recommender systems – a case study, (Proceedings of the ACM WebKDD 2000 Web Mining for E-Commerce Workshop. Proceedings of the ACM WebKDD 2000 Web Mining for E-Commerce Workshop, Boston, MA (2000))
[3] Patcha, A.; Park, J.-M., An overview of anomaly detection techniques: existing solutions and latest technological trends, Comput. Netw., 51, 3448-3470 (2007)
[4] Das, P.; Moll, M.; Stamati, H.; Kavraki, L. E.; Clementi, C., Low-dimensional free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction, Proc. Natl. Acad. Sci. USA, 103, 9885-9890 (2006)
[5] Transtrum, M. K.; Machta, B. B.; Brown, K. S.; Daniels, B. C.; Myers, C. R.; Sethna, J. P., Perspective: sloppiness and emergent theories in physics, biology, and beyond, J. Chem. Phys., 143, Article 010901 pp. (2015)
[6] Machta, B. B.; Chachra, R.; Transtrum, M. K.; Sethna, J. P., Parameter space compression underlies emergent theories and predictive models, Science, 342, 604-607 (2013)
[7] Ferguson, A. L.; Panagiotopoulos, A. Z.; Debenedetti, P. G.; Kevrekidis, I. G., Systematic determination of order parameters for chain dynamics using diffusion maps, Proc. Natl. Acad. Sci. USA, 107, 13597-13602 (2010)
[8] Zwanzig, R., Nonequilibrium Statistical Mechanics (2001), Oxford University Press: Oxford University Press New York · Zbl 1267.82001
[9] Coifman, R. R.; Kevrekidis, I. G.; Lafon, S.; Maggioni, M.; Nadler, B., Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems, Multiscale Model. Simul., 7, 842-864 (2008) · Zbl 1175.60058
[10] Peña, D.; Poncela, P., Dimension reduction in multivariate time series, (Balakrishnan, N.; Sarabia, J. M.; Castillo, E., Advances in Distribution Theory, Order Statistics, and Inference (2006), Birkhäuser Boston: Birkhäuser Boston Boston, MA), 433-458 · Zbl 05196687
[11] Ferguson, A. L.; Zhang, S.; Dikiy, I.; Panagiotopoulos, A. Z.; Debenedetti, P. G.; Link, A. J., An experimental and computational investigation of spontaneous lasso formation in microcin J25, Biophys. J., 99, 3056-3065 (2010)
[12] Linden, G.; Smith, B.; York, J., Amazon.com recommendations: item-to-item collaborative filtering, IEEE Int. Comput., 7, 76-80 (2003)
[13] Jolliffe, I. T., Principal component analysis, (Principal Component Analysis (2002), Springer: Springer New York) · Zbl 1011.62064
[14] Borg, I.; Groenen, P. J., Modern Multidimensional Scaling: Theory and Applications (2005), Springer: Springer New York · Zbl 1085.62079
[15] Bingham, E.; Mannila, H., Random projection in dimensionality reduction: applications to image and text data, (Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY (2001)), 245-250
[16] Tenenbaum, J. B.; De Silva, V.; Langford, J. C., A global geometric framework for nonlinear dimensionality reduction, Science, 290, 2319-2323 (2000)
[17] Roweis, S. T.; Saul, L. K., Nonlinear dimensionality reduction by locally linear embedding, Science, 290, 2323-2326 (2000)
[18] Coifman, R. R.; Lafon, S., Diffusion maps, Appl. Comput. Harmon. Anal., 21, 5-30 (2006) · Zbl 1095.68094
[19] Coifman, R. R.; Lafon, S.; Lee, A. B.; Maggioni, M.; Nadler, B.; Warner, F.; Zucker, S. W., Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, Proc. Natl. Acad. Sci. USA, 102, 7426-7431 (2005) · Zbl 1405.42043
[20] Nadler, B.; Lafon, S.; Coifman, R. R.; Kevrekidis, I. G., Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, Adv. Neural Inf. Process. Syst., 18, 955-962 (2006)
[21] Ferguson, A. L.; Panagiotopoulos, A. Z.; Kevrekidis, I. G.; Debenedetti, P. G., Nonlinear dimensionality reduction in molecular simulation: the diffusion map approach, Chem. Phys. Lett., 509, 1-11 (2011)
[22] Mansbach, R. A.; Ferguson, A. L., Machine learning of single molecule free energy surfaces and the impact of chemistry and environment upon structure and dynamics, J. Chem. Phys., 142, Article 105101 pp. (2015)
[23] Long, A. W.; Zhang, J.; Granick, S.; Ferguson, A. L., Machine learning assembly landscapes from particle tracking data, Soft Matter, 11, 8141-8153 (2015)
[24] Coifman, R.; Shkolnisky, Y.; Sigworth, F.; Singer, A., Graph Laplacian tomography from unknown random projections, IEEE Trans. Image Process., 17, 1891-1899 (2008) · Zbl 1372.94055
[25] Gepshtein, S.; Keller, Y., Image completion by diffusion maps and spectral relaxation, IEEE Trans. Image Process., 22, 2983-2994 (2013) · Zbl 1373.94137
[26] Hu, J.; Ferguson, A. L., Global graph matching using diffusion maps, Intell. Data Anal., 20, 637-654 (2016)
[27] Pan, V. Y.; Chen, Z. Q., The complexity of the matrix eigenproblem, (Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, New York, NY, USA (1999)), 507-516 · Zbl 1346.68103
[28] Belkin, M.; Niyogi, P., Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15, 1373-1396 (2003) · Zbl 1085.68119
[29] Kao, M.-Y., Encyclopedia of Algorithms (2008), Springer Science & Business Media · Zbl 1149.68078
[30] Golub, G.; Van Loan, C., Matrix Computations, Johns Hopkins Stud. Math. Sci. (2013), Johns Hopkins University Press · Zbl 1268.65037
[31] Bechtold, T.; Rudnyi, E. B.; Korvink, J. G., Fast Simulation of Electro-Thermal MEMS: Efficient Dynamic Compact Models (2006), Springer-Verlag: Springer-Verlag Berlin, Heidelberg, Germany
[32] Larsen, R. M., Lanczos Bidiagonalization with Partial Reorthogonalization (1998), DAIMI PB-357 Technical Report
[33] Bengio, Y.; Paiement, J.-F.; Vincent, P.; Delalleau, O.; Le Roux, N.; Ouimet, M., Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering, Adv. Neural Inf. Process. Syst., 16, 177-184 (2004)
[34] Aizenbud, Y.; Bermanis, A.; Averbuch, A., PCA-based out-of-sample extension for dimensionality reduction (2015)
[35] Rabin, N.; Coifman, R. R., Heterogeneous datasets representation and learning using diffusion maps and Laplacian pyramids, (Proceedings of the 2012 SIAM International Conference on Data Mining (2012)), 189-199
[36] Fowlkes, C.; Belongie, S.; Chung, F.; Malik, J., Spectral grouping using the Nystrom method, IEEE Trans. Pattern Anal. Mach. Intell., 26, 214-225 (2004)
[37] Lafon, S.; Keller, Y.; Coifman, R. R., Data fusion and multicue data matching by diffusion maps, IEEE Trans. Pattern Anal. Mach. Intell., 28, 1784-1797 (2006)
[38] Baker, C. T.H., The Numerical Treatment of Integral Equations, vol. 13 (1977), Clarendon Press: Clarendon Press Oxford · Zbl 0373.65060
[39] Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; Stolfo, S., A geometric framework for unsupervised anomaly detection, (Applications of Data Mining in Computer Security, vol. 6 (2002), Springer), 77-101
[40] Mahoney, A.; Bross, J.; Johnson, D., Deformable robot motion planning in a reduced-dimension configuration space, (2010 IEEE International Conference on Robotics and Automation (2010)), 5133-5138
[41] Chen, Y. F.; Liu, S.-Y.; Liu, M.; Miller, J.; How, J. P., Motion planning with diffusion maps, (2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (2016))
[42] Voter, A. F., Parallel replica method for dynamics of infrequent events, Phys. Rev. B, 57, Article R13985 pp. (1998)
[43] Allen, R. J.; Valeriani, C.; ten Wolde, P. R., Forward flux sampling for rare event simulations, J. Phys. Condens. Matter, 21, Article 463102 pp. (2009)
[44] Escobedo, F. A.; Borrero, E. E.; Araque, J. C., Transition path sampling and forward flux sampling. Applications to biological systems, J. Phys. Condens. Matter, 21, Article 333101 pp. (2009)
[45] de Silva, V.; Tenenbaum, J. B., Global versus local methods in nonlinear dimensionality reduction, Adv. Neural Inf. Process. Syst., 15, 721-728 (2003)
[46] Silva, J.; Marques, J.; Lemos, J., Selecting landmark points for sparse manifold learning, Adv. Neural Inf. Process. Syst., 18, 1241-1248 (2006)
[47] Singer, A., A remark on global positioning from local distances, Proc. Natl. Acad. Sci. USA, 105, 9507-9511 (2008) · Zbl 1205.86043
[48] Lei, Y.-K.; Xu, Y.; Zhang, S.-W.; Wang, S.-L.; Ding, Z.-G., Fast ISOMAP based on minimum set coverage, (Huang, D.-S.; Zhang, X.; Reyes García, C. A.; Zhang, L., Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence (2010), Springer Berlin Heidelberg: Springer Berlin Heidelberg Berlin, Heidelberg), 173-179 · Zbl 1194.68042
[49] Shi, H.; Yin, B.; Bao, Y.; Lei, Y.-K., A novel landmark point selection method for L-ISOMAP, (2016 12th IEEE International Conference on Control and Automation (ICCA) (2016)), 621-625
[50] Wang, J.; Ferguson, A. L., Nonlinear reconstruction of single-molecule free-energy surfaces from univariate time series, Phys. Rev. E, 93, Article 032412 pp. (2016)
[51] Zheng, W.; Rohrdanz, M. A.; Clementi, C., Rapid exploration of configuration space with diffusion-map-directed molecular dynamics, J. Phys. Chem. B, 117, 12769-12776 (2013)
[52] Preto, J.; Clementi, C., Fast recovery of free energy landscapes via diffusion-map-directed molecular dynamics, Phys. Chem. Chem. Phys., 16, 19181-19191 (2014)
[53] Chiavazzo, E.; Covino, R.; Coifman, R. R.; Gear, C. W.; Georgiou, A. S.; Hummer, G.; Kevrekidis, I. G., Intrinsic map dynamics exploration for uncharted effective free-energy landscapes, Proc. Natl. Acad. Sci., 114, E5494-E5503 (2017)
[54] Nadler, B.; Lafon, S.; Coifman, R. R.; Kevrekidis, I. G., Diffusion maps, spectral clustering and reaction coordinates of dynamical systems, Appl. Comput. Harmon. Anal., 21, 113-127 (2006) · Zbl 1103.60069
[55] Sonday, B. E.; Haataja, M.; Kevrekidis, I. G., Coarse-graining the dynamics of a driven interface in the presence of mobile impurities: effective description via diffusion maps, Phys. Rev. E, 80, Article 031102 pp. (2009)
[56] Ferguson, A. L.; Panagiotopoulos, A. Z.; Debenedetti, P. G.; Kevrekidis, I. G., Integrating diffusion maps with umbrella sampling: application to alanine dipeptide, J. Chem. Phys., 134, Article 135103 pp. (2011)
[57] Cormen, T. H., Introduction to Algorithms (2009), MIT Press: MIT Press Cambridge, MA · Zbl 1187.68679
[58] Prim, R. C., Shortest connection networks and some generalizations, Bell Syst. Tech. J., 36, 1389-1401 (1957)
[59] Von Luxburg, U., A tutorial on spectral clustering, Stat. Comput., 17, 395-416 (2007)
[60] Frey, B. J.; Dueck, D., Clustering by passing messages between data points, Science, 315, 972-976 (2007) · Zbl 1226.94027
[61] Day, W. H.; Edelsbrunner, H., Efficient algorithms for agglomerative hierarchical clustering methods, J. Classification, 1, 7-24 (1984) · Zbl 0563.62034
[62] Park, H.-S.; Jun, C.-H., A simple and fast algorithm for k-medoids clustering, Expert Syst. Appl., 36, 3336-3341 (2009)
[63] Arthur, D.; Vassilvitskii, S., k-means++: the advantages of careful seeding, (Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (2007)), 1027-1035 · Zbl 1302.68273
[64] Deif, A. S., Rigorous perturbation bounds for eigenvalues and eigenvectors of a matrix, J. Comput. Appl. Math., 57, 403-412 (1995) · Zbl 0823.15017
[65] Hummer, G.; Kevrekidis, I. G., Coarse molecular dynamics of a peptide fragment: free energy, kinetics, and long-time dynamics computations, J. Chem. Phys., 118, 10762-10773 (2003)
[66] Chodera, J. D.; Swope, W. C.; Pitera, J. W.; Dill, K. A., Long-time protein folding dynamics from short-time molecular dynamics simulations, Multiscale Model. Simul., 5, 1214-1226 (2006) · Zbl 1133.92011
[67] Ma, A.; Dinner, A. R., Automatic method for identifying reaction coordinates in complex systems, J. Phys. Chem. B, 109, 6769-6779 (2005)
[68] Stamati, H.; Clementi, C.; Kavraki, L. E., Application of nonlinear dimensionality reduction to characterize the conformational landscape of small peptides, Proteins, Struct. Funct. Bioinform., 78, 223-235 (2010)
[69] Michielssens, S.; van Erp, T. S.; Kutzner, C.; Ceulemans, A.; de Groot, B. L., Molecular dynamics in principal component space, J. Phys. Chem. B, 116, 8350-8354 (2012)
[70] Chodera, J. D.; Singhal, N.; Pande, V. S.; Dill, K. A.; Swope, W. C., Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics, J. Chem. Phys., 126, Article 155101 pp. (2007)
[71] Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J.C., GROMACS: fast, flexible, and free, J. Comput. Chem., 26, 1701-1718 (2005)
[72] Martin, M. G.; Siepmann, J. I., Transferable potentials for phase equilibria. 1. United-atom description of n-alkanes, J. Phys. Chem. B, 102, 2569-2577 (1998)
[73] Berendsen, H. J.; Postma, J. P.; van Gunsteren, W. F.; Hermans, J., Interaction models for water in relation to protein hydration, (Intermolecular Forces (1981), Springer), 331-342
[74] Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L., Comparison of simple potential functions for simulating liquid water, J. Chem. Phys., 79, 926-935 (1983)
[75] Kaminski, G. A.; Friesner, R. A.; Tirado-Rives, J.; Jorgensen, W. L., Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides, J. Phys. Chem. B, 105, 6474-6487 (2001)
[76] Jorgensen, W. L.; Tirado-Rives, J., The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin, J. Amer. Chem. Soc., 110, 1657-1666 (1988)
[77] Muja, M.; Lowe, D. G., Scalable nearest neighbor algorithms for high dimensional data, IEEE Trans. Pattern Anal. Mach. Intell., 36, 2227-2240 (2014)
[78] McQueen, J.; Meila, M.; VanderPlas, J.; Zhang, Z., Megaman: scalable manifold learning in Python, J. Mach. Learn. Res., 17, 1-5 (2016) · Zbl 1393.68155
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.