High-Dimensional Dataset Simplification by Laplace-Beltrami Operator

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13002))

Included in the following conference series:

Computer Graphics International Conference

2001 Accesses

Abstract

With the development of the Internet and other digital technologies, the speed of data generation has become considerably faster than the speed of data processing. Because big data typically contain massive redundant information, it is possible to significantly simplify a big dataset while maintaining the key information. In this paper, we develop a high-dimensional (HD) dataset simplification method based on the eigenvalues and eigenfunctions of the Laplace-Beltrami operator (LBO). Specifically, given a dataset that can be considered as an unorganized data point set in an HD space, a discrete LBO defined on the HD dataset is constructed, and its eigenvalues and eigenvectors are calculated. Then, the local extremum and saddle points of the eigenvectors are proposed to be the feature points of the HD dataset, constituting a simplified dataset. Moreover, we develop feature point detection methods for the functions defined on an unorganized data point set in HD space, and devise metrics for measuring the fidelity of the simplified dataset to the original set. Finally, examples and applications are demonstrated to validate the efficiency and effectiveness of the proposed methods, demonstrating that the developed HD dataset simplification method is feasible for processing a maximum-sized dataset using a limited data processing capability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Nonlinear Dimension Reduction by Local Multidimensional Scaling

An Extended Isomap Approach for Nonlinear Dimension Reduction

Article 09 May 2020

Central Point Selection in Dimension Reduction Projection Simple-Map with Binary Quantization

References

Albattah, W.: The role of sampling in big data analysis. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, pp. 1–5 (2016)
Google Scholar
Aoyama, H.: A study of the stratified random sampling. Ann. Inst. Stat. Math. 6(1), 1–36 (1954)
Article MathSciNet Google Scholar
Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 22(4), 469–483 (1996)
Article MathSciNet Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)
Article Google Scholar
Belkin, M., Sun, J., Wang, Y.: Constructing Laplace operator from point clouds in $\mathbb{R}^d$. In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 1031–1040. SIAM (2009)
Google Scholar
Bengio, Y., Paiement, J.F., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, pp. 177–184 (2004)
Google Scholar
Benko, K., Kothe, M., Semmler, K.D., Simon, U.: Eigenvalues of the Laplacian and curvature. In: Colloquium Mathematicum, vol. 42, pp. 19–31. Institute of Mathematics Polish Academy of Sciences (1979)
Google Scholar
Carr, H., Duke, D.J.: Joint contour nets. IEEE Trans. Vis. Comput. Graph. 20(8), 1100–1113 (2014)
Article Google Scholar
Carr, H., Snoeyink, J., Van De Panne, M.: Simplifying flexible isosurfaces using local geometric measures. In: IEEE Visualization 2004, pp. 497–504 (2004)
Google Scholar
Castellani, U., Cristani, M., Fantoni, S., Murino, V.: Sparse points matching by combining 3d mesh saliency with statistical descriptors. Comput. Graph. Forum 27(2), 643–652 (2008)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res 16, 321–357 (2002)
Article Google Scholar
Dong, S., Bremer, P.T., Garland, M., Pascucci, V., Hart, J.C.: Spectral surface quadrangulation. ACM Trans. Graph. 25(3), 1057–1066 (2006)
Article Google Scholar
Dubuisson, M.P., Jain, A.K.: A modified hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566–568 (1994)
Google Scholar
Edelsbrunner, H., Harer, J., Patel, A.K.: Reeb spaces of piecewise linear mappings. In: Proceedings of the Twenty-Fourth Annual Symposium on Computational Geometry, pp. 242–250 (2008)
Google Scholar
Gyulassy, A., Bremer, P.T., Hamann, B., Pascucci, V.: A practical approach to morse-smale complex computation: scalability and generality. IEEE Trans. Vis. Comput. Graph. 14(6), 1619–1626 (2008)
Article Google Scholar
Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
Article Google Scholar
Körtgen, M., Park, G.J., Novotni, M., Klein, R.: 3d shape matching with 3d shape contexts. In: The 7th Central European Seminar on Computer Graphics, vol. 3, pp. 5–17. Budmerice (2003)
Google Scholar
Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. In: IEEE Access, p. 1 (2020)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2480–2487 (2012)
Google Scholar
Mahmud, M., Huang, J., Salloum, S., Emara, T., Sadatdiynov, K.: A survey of data partitioning and sampling methods to support big data analysis. Big Data Min. Anal. 3, 85–101 (2020)
Article Google Scholar
Nassiuma, D.K.: Survey Sampling: Theory and Methods. Nairobi University Press, Nairobi (2001)
Google Scholar
Ovsjanikov, M., Mérigot, Q., Mémoli, F., Guibas, L.: One point isometric matching with the heat kernel. Comput. Graph. Forum 29(5), 1555–1564 (2010)
Article Google Scholar
Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: RED-Net: a recurrent encoder-decoder network for video-based face alignment. Int. J. Comput. Vision 126(10), 1103–1119 (2018)
Article MathSciNet Google Scholar
Pérez-Cruz, F.: Kullback-Leibler divergence estimation of continuous distributions. In: 2008 IEEE International Symposium on Information Theory, pp. 1666–1670 (2008)
Google Scholar
Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Netw. 11(4), 761–767 (1998)
Article Google Scholar
Reeb, G.: Sur les points singuliers dune forme de pfaff compltement intgrable ou dune fonction numrique. Comptes Rendus Hebdomadaires des Sances de lAcadmie des Sciences 222, 847–849 (1946)
MATH Google Scholar
Reuter, M., Wolter, F.E., Peinecke, N.: Laplace-Beltrami spectra as ‘shape-DNA’ of surfaces and solids. Comput.-Aided Des. 38(4), 342–366 (2006)
Article Google Scholar
Singh, G., Mémoli, F., Carlsson, G.E.: Topological methods for the analysis of high dimensional data sets and 3d object recognition. In: SPBG, pp. 91–100 (2007)
Google Scholar
Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably informative multi-scale signature based on heat diffusion. Comput. Graph. Forum 28(5), 1383–1392 (2009)
Article Google Scholar
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)
Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Vallet, B., Lévy, B.: Spectral geometry processing with manifold harmonics. Comput. Graph. Forum 27(2), 251–260 (2008)
Article Google Scholar
Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)
Article Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant Nos. 61872316, 61932018, and the National Key R&D Plan of China under Grant No. 2020YFB1708900.

Author information

Authors and Affiliations

School of Mathematical Sciences, Zhejiang University, Hangzhou, China
Chenkai Xu & Hongwei Lin
State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou, China
Hongwei Lin

Authors

Chenkai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongwei Lin .

Editor information

Editors and Affiliations

University of Geneva, Carouge, Switzerland
Nadia Magnenat-Thalmann
University of Minnesota, Minneapolis, MN, USA
Victoria Interrante
EPFL, Lausanne, Switzerland
Daniel Thalmann
University of Crete, Heraklion, Crete, Greece
George Papagiannakis
Shanghai Jiao Tong University, Shanghai, China
Bin Sheng
University of Sydney, Sydney, NSW, Australia
Jinman Kim
University of Calgary, Calgary, AB, Canada
Marina Gavrilova

Appendices

Appendix 1. The Detailed Calculation of Metrics

KL-divergence-Based Metric. KL-divergence, also called relative entropy, is typically employed to measure the difference between two discrete probability distributions $P = \{p_{i}\}$ and $Q = \{q_{i}\}$, i.e., $ D_{KL}(P\Vert Q) = \sum _{i} p_{i} log\left( \frac{p_{i}}{q_{i}}\right) .$ To define the KL-divergence-based metric, the original dataset $\mathcal {H}$ and simplified dataset $\mathcal {S}$ are represented as $M_{\mathcal {H}} \in \mathbb {R}^{n \times d}$ and $M_{\mathcal {S}}\in \mathbb {R}^{m \times d}$, where n and m $(n > m)$ are the number of data points in $\mathcal {H}$ and $\mathcal {S}$, respectively. Each row of the matrices stores a data point in $\mathbb {R}^d$. First, we calculate the probability distribution $P^{(k)} = \{p^{(k)}_i, i = 1,2,\cdots ,l\}$ for the $k^{th}$ dimension of the data points in $\mathcal {H}$, corresponding to the $k^{th}$ column of matrix $M_{\mathcal {H}}$. To do this, the $k^{th}$ column elements of matrix $M_{\mathcal {H}}$ are normalized into the interval [0, 1], which is divided into l subintervals, (in our implementation, we set $l=100$). By counting the number of elements of the $k^{th}$ column of matrix $M_{\mathcal {H}}$ whose values lie in the l subintervals, the value distribution histogram is generated, which is then used as the probability distribution $P^{(k)} = \{p^{(k)}_i, i=1,2,\cdots ,l\}$ of the $k^{th}$ column of matrix $M_{\mathcal {H}}$. Similarly, the probability distribution $Q^{(k)} = \{q^{(k)}_i, i=1,2,\cdots ,l\}$ of the $k^{th}$ column of matrix $M_{\mathcal {S}}$ can be calculated. Then, the KL-divergence for the $k^{th}$ column elements of matrix $M_{\mathcal {H}}$ and $M_{\mathcal {S}}$ is $D_{KL}(P^{(k)}\Vert Q^{(k)}) = \sum _{i} p^{(k)}_{i} log\left( \frac{p^{(k)}_{i}}{q^{(k)}_{i}}\right) $. And the KL-divergence-based metric is defined as

$$\begin{aligned} d_{KL}(\mathcal {H}, \mathcal {S}) = \sqrt{\sum _{k=1}^d (D_{KL}(P^{(k)}\Vert Q^{(k)}))^2}. \end{aligned}$$

(6)

The KL-divergence-based metric (6) transforms the HD dataset into its probability distribution, and measures the difference between the probability distributions of the original and simplified datasets. As the simplified dataset $\mathcal {S}$ becomes increasingly closer to the original dataset $\mathcal {H}$, the KL-divergence-based metric $d_{KL}$ becomes increasingly smaller. When $\mathcal {S} = \mathcal {H}$, $d_{KL} = 0$.

Hausdorff Distance. The Hausdorff distance measures how far two sets are away from each other. Suppose $\mathcal {X}$ and $\mathcal {Y}$ are two dataset in d-dimensional space, and denote $D(\mathcal {X},\mathcal {Y}) = \sup _{x\in \mathcal {X}}\inf _{y\in \mathcal {Y}} d(x,y)$ as the maximum distance from an arbitrary point in set $\mathcal {X}$ to set $\mathcal {Y}$, where d(x, y) is the Euclidean distance between two points x and y. The Hausdorff distance between $\mathcal {H}$ and $\mathcal {S}$ is defined as $ d_{H}(\mathcal {H},\mathcal {S}) = max \{D(\mathcal {H},\mathcal {S}),D(\mathcal {S},\mathcal {H})\}$. Note that the simplified dataset $\mathcal {S}$ is a subset of the original dataset $\mathcal {H}$, and then, $D(\mathcal {S},\mathcal {H}) = 0$. Therefore, the Hausdorff distance between $\mathcal {H}$ and $\mathcal {S}$ can be calculated by

(7)

That is, the Hausdorff distance is the maximum distance from an arbitrary point in the original dataset $\mathcal {H}$ to the simplified dataset $\mathcal {S}$.

Volume of Convex Hull. As the third metric, we use the difference of the volumes of the convex hulls to measure the fidelity of $\mathcal {S}$ to $\mathcal {H}$, i.e.,

$$\begin{aligned} d_{V}(\mathcal {H},\mathcal {S}) = \left| V(\mathcal {H}_{conv})-V(\mathcal {S}_{conv}) \right| , \end{aligned}$$

(8)

where $\mathcal {H}_{conv}$ and $\mathcal {S}_{conv}$ are the convex hulls of $\mathcal {H}$ and $\mathcal {S}$, and $V(\mathcal {H}_{conv})$ and $V(\mathcal {S}_{conv})$ are their volumes. Because the volume of a convex hull in an HD space is difficult to calculate, we first perform the DR operation [4] to project the datasets into 3D space. Afterwards, we employ the quickhull algorithm [3] to construct the convex hulls of the two datasets, and calculate their volumes correspondingly.

Appendix 2. The Simplification Procedures on MNIST and Human Face Dataset

MNIST: For clarity of the visual analysis, we selected 900 units of the gray-images for the digits ‘0’ to ‘4’, from the MNIST dataset, and resized each image as $8 \times 8$. The selected images were projected into the 2D-plane using the t-SNE. Whereas in Fig. 6(e), colorful digits are placed at the projected 2D points, in Fig. 6(f), the images themselves are located at the projected points. The simplified datasets are displayed in Fig. 6(a)–6(d). For illustration, their positions in the 2D-plane are extracted from the dimensionality reduced dataset (Fig. 6(e)). In the simplified dataset in Fig. 6(a), which has 82 data points (approximately $9\%$ of the points of the original dataset), all five categories of the digits are represented. In the simplified dataset in Fig. 6(c), which has 196 data points (approximately $22\%$ of the points of the original dataset), virtually all of the key features of the dataset are represented, including the subset of digit ‘2’ in the class of digit ‘1’. In Fig. 6(d), the simplified dataset contains 280 data points (approximately $31\%$ of the points of the original dataset). We can observe that the boundary of each class of data points has formed (Fig. 6(d)), and the geometric shape of each class is similar to that of the corresponding class in the original dataset (Fig. 6(e)).

Human Face: The human face dataset contains 698 gray-images($64 \times 64$) of a plaster head sculpture. The images of the dataset were captured by adjusting three parameters: up-down pose, left-right pose, and lighting direction. Therefore, the images in the dataset can be regarded as distributing on a 3D-manifold. Thus, as illustrated in Fig. 7, after the human face dataset is dimensionally reduced into a 2D-plane using PCA, it can observed that the longitudinal axis represents the sculpture’s rotation in the right-left direction, and the horizontal axis indicates the change of lighting.

In Fig. 7, the images marked with red frames constitute the simplified dataset. In the simplified dataset in Fig. 8(a), the critical points at the boundary are captured. In Figs. 8(b)–8(d), increasingly more critical points at the boundary and interior are added into the simplified datasets. In Fig. 8(e), virtually all of the images at the boundary are added into the simplified set. For clarity, we display the feature points of the first five LBO eigenvectors in Fig. 7(f), which indeed capture several ‘extreme’ images in the human face dataset. Specifically, the two feature points of the first eigenvector, $\phi _1$, are two images with ‘leftmost pose + most lighting’ and ‘rightmost pose + most lighting’. In the feature point set of the second eigenvector, $\phi _2$, an image with ‘leftmost pose + fewest lighting’ is added. Similarly, in the feature point sets of the eigenvectors $\phi _3, \phi _4$, and $\phi _5$, there are more ‘extreme’ images of the pose and lighting (Fig. 7(f)).

Appendix 3. Additional Comparisons

To compare the two methods in more detail, the simplified datasets by the SRS method with the best performance in the 200 times of samplings are visualized in Fig. 8, after DR by PCA. Compared with the simplified datasets generated by our method in Fig. 7, the simplified datasets by our method contain more data points on the boundaries. It is well known that the boundary points are more important than the inner points of a dataset, so the simplified datasets generated by our method contain more important information than those by the SRS method.

Compared with other existing statistical sampling methods, the proposed method can be applied to a broader range of dataset types. For some sampling methods, such as stratified sampling [2], ‘Tomek links’ [32], and ‘SMOTE’ [11], data classification is required as supplementary information. However, when the data classification labels are missing, the above methods cannot be implemented. Compared with the topological dataset simplification methods, such as Reeb spaces [14], Joint contour nets [8], and Mapper [29], our method also has obvious advantages. For our method, the size of the simplified dataset is easier to control than those of topological methods. More importantly, in some cases, topological methods may generate new data points whose ‘legality’ cannot be guaranteed. For example, the simplified human face dataset obtained by topological methods can not guarantee that each data point is a ‘correct’ plaster photo. Therefore, topological simplification methods are suitable for extracting the topological structure of a given dataset, but unsuitable for generating simplified datasets with specified numbers.

Appendix 4. Additional Applications

Speedup of DR: The computation of the DR algorithms for an HD dataset is typically complicated. Given an original dataset $\mathcal {H}$, if we first perform the DR on the simplified dataset $\mathcal {S}$, and then employ the ‘out-of-sample’ algorithm [6] to calculate the DR coordinates of the remaining data points, the computation requirement can be reduced significantly. Because the simplified dataset $\mathcal {S}$ captures the feature points of the original dataset, the result generated by DR on simplified set + ‘out-of-sample’ is similar as that by DR on the original set. In Table 2, the running time for the methods ‘DR on simplified set + out-of-sample’ and ‘DR on original set’ are listed. The running-time savings of one orders of magnitude.

Table 2. Timing for speedup of DR (in seconds).

Full size table

Training Data Simplification: With the increase of the use of training data in supervised learning, the cost of manual annotation has become increasingly more expensive. Moreover, repeated samples or ‘bad’ samples can be added into the training dataset, adding confusion and complication into the training data, and influencing the convergence of the training. Therefore, in supervised and active learning, simplifying the training dataset by retaining the feature points and deleting the repeated or ‘bad’ samples can lighten the burden of the manual annotation, reduce the number of epochs, and the save storage space.

First, we studied the relationship between the simplification rate and test accuracy on MNIST and CIFAR-10. In Fig. 9(a), the diagram of the simplification rate $r_i$ v.s. test accuracy is illustrated. For the dataset MNIST, when $r_i$ = 0.177, the test accuracy achieves $97.4\%$, which is close to the test accuracy of $98.98\%$ using the original training data. For the dataset CIFAR-10, when $r_i$ = 0.584, the test accuracy is $78.2\%$, whereas using the original training images, the test accuracy is $80.2\%$. Therefore, with the proposed simplification method, the simplified dataset is sufficient for training in the two image classification tasks.

Furthermore, with the simplified training set, the training process can terminate earlier than that with the original training set, thus reducing the number of epochs. In Fig. 9(b), the diagrams of the number of epochs v.s. test accuracy are illustrated. The ‘early stopping’ criterion proposed in [26] was used to determine the stopping time (indicated by arrows). For the dataset MNIST, the training process with the original training set stopped at 12000 epochs, with a test accuracy $99.07\%$. With the simplified training set, the training process stopped as early as at 6000 epochs, with a test accuracy $95.25\%$. For the dataset CIFAR-10, the training process stopped at 8000 epochs with the simplified training set, whereas the training process with the original training set stopped at 11500 epochs. Therefore, with the simplified training dataset, the training epochs can be reduced significantly.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, C., Lin, H. (2021). High-Dimensional Dataset Simplification by Laplace-Beltrami Operator. In: Magnenat-Thalmann, N., et al. Advances in Computer Graphics. CGI 2021. Lecture Notes in Computer Science(), vol 13002. Springer, Cham. https://doi.org/10.1007/978-3-030-89029-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-89029-2_43
Published: 11 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89028-5
Online ISBN: 978-3-030-89029-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics