Skip to main content

High-Dimensional Dataset Simplification by Laplace-Beltrami Operator

  • Conference paper
  • First Online:
Advances in Computer Graphics (CGI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13002))

Included in the following conference series:

  • 2001 Accesses

Abstract

With the development of the Internet and other digital technologies, the speed of data generation has become considerably faster than the speed of data processing. Because big data typically contain massive redundant information, it is possible to significantly simplify a big dataset while maintaining the key information. In this paper, we develop a high-dimensional (HD) dataset simplification method based on the eigenvalues and eigenfunctions of the Laplace-Beltrami operator (LBO). Specifically, given a dataset that can be considered as an unorganized data point set in an HD space, a discrete LBO defined on the HD dataset is constructed, and its eigenvalues and eigenvectors are calculated. Then, the local extremum and saddle points of the eigenvectors are proposed to be the feature points of the HD dataset, constituting a simplified dataset. Moreover, we develop feature point detection methods for the functions defined on an unorganized data point set in HD space, and devise metrics for measuring the fidelity of the simplified dataset to the original set. Finally, examples and applications are demonstrated to validate the efficiency and effectiveness of the proposed methods, demonstrating that the developed HD dataset simplification method is feasible for processing a maximum-sized dataset using a limited data processing capability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 89.00
Price excludes VAT (USA)
Softcover Book
USD 119.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Albattah, W.: The role of sampling in big data analysis. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, pp. 1–5 (2016)

    Google Scholar 

  2. Aoyama, H.: A study of the stratified random sampling. Ann. Inst. Stat. Math. 6(1), 1–36 (1954)

    Article  MathSciNet  Google Scholar 

  3. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 22(4), 469–483 (1996)

    Article  MathSciNet  Google Scholar 

  4. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)

    Article  Google Scholar 

  5. Belkin, M., Sun, J., Wang, Y.: Constructing Laplace operator from point clouds in \(\mathbb{R}^d\). In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 1031–1040. SIAM (2009)

    Google Scholar 

  6. Bengio, Y., Paiement, J.F., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, pp. 177–184 (2004)

    Google Scholar 

  7. Benko, K., Kothe, M., Semmler, K.D., Simon, U.: Eigenvalues of the Laplacian and curvature. In: Colloquium Mathematicum, vol. 42, pp. 19–31. Institute of Mathematics Polish Academy of Sciences (1979)

    Google Scholar 

  8. Carr, H., Duke, D.J.: Joint contour nets. IEEE Trans. Vis. Comput. Graph. 20(8), 1100–1113 (2014)

    Article  Google Scholar 

  9. Carr, H., Snoeyink, J., Van De Panne, M.: Simplifying flexible isosurfaces using local geometric measures. In: IEEE Visualization 2004, pp. 497–504 (2004)

    Google Scholar 

  10. Castellani, U., Cristani, M., Fantoni, S., Murino, V.: Sparse points matching by combining 3d mesh saliency with statistical descriptors. Comput. Graph. Forum 27(2), 643–652 (2008)

    Article  Google Scholar 

  11. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res 16, 321–357 (2002)

    Article  Google Scholar 

  12. Dong, S., Bremer, P.T., Garland, M., Pascucci, V., Hart, J.C.: Spectral surface quadrangulation. ACM Trans. Graph. 25(3), 1057–1066 (2006)

    Article  Google Scholar 

  13. Dubuisson, M.P., Jain, A.K.: A modified hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566–568 (1994)

    Google Scholar 

  14. Edelsbrunner, H., Harer, J., Patel, A.K.: Reeb spaces of piecewise linear mappings. In: Proceedings of the Twenty-Fourth Annual Symposium on Computational Geometry, pp. 242–250 (2008)

    Google Scholar 

  15. Gyulassy, A., Bremer, P.T., Hamann, B., Pascucci, V.: A practical approach to morse-smale complex computation: scalability and generality. IEEE Trans. Vis. Comput. Graph. 14(6), 1619–1626 (2008)

    Article  Google Scholar 

  16. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)

    Article  Google Scholar 

  17. Körtgen, M., Park, G.J., Novotni, M., Klein, R.: 3d shape matching with 3d shape contexts. In: The 7th Central European Seminar on Computer Graphics, vol. 3, pp. 5–17. Budmerice (2003)

    Google Scholar 

  18. Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. In: IEEE Access, p. 1 (2020)

    Google Scholar 

  19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  20. Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2480–2487 (2012)

    Google Scholar 

  21. Mahmud, M., Huang, J., Salloum, S., Emara, T., Sadatdiynov, K.: A survey of data partitioning and sampling methods to support big data analysis. Big Data Min. Anal. 3, 85–101 (2020)

    Article  Google Scholar 

  22. Nassiuma, D.K.: Survey Sampling: Theory and Methods. Nairobi University Press, Nairobi (2001)

    Google Scholar 

  23. Ovsjanikov, M., Mérigot, Q., Mémoli, F., Guibas, L.: One point isometric matching with the heat kernel. Comput. Graph. Forum 29(5), 1555–1564 (2010)

    Article  Google Scholar 

  24. Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: RED-Net: a recurrent encoder-decoder network for video-based face alignment. Int. J. Comput. Vision 126(10), 1103–1119 (2018)

    Article  MathSciNet  Google Scholar 

  25. Pérez-Cruz, F.: Kullback-Leibler divergence estimation of continuous distributions. In: 2008 IEEE International Symposium on Information Theory, pp. 1666–1670 (2008)

    Google Scholar 

  26. Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Netw. 11(4), 761–767 (1998)

    Article  Google Scholar 

  27. Reeb, G.: Sur les points singuliers dune forme de pfaff compltement intgrable ou dune fonction numrique. Comptes Rendus Hebdomadaires des Sances de lAcadmie des Sciences 222, 847–849 (1946)

    MATH  Google Scholar 

  28. Reuter, M., Wolter, F.E., Peinecke, N.: Laplace-Beltrami spectra as ‘shape-DNA’ of surfaces and solids. Comput.-Aided Des. 38(4), 342–366 (2006)

    Article  Google Scholar 

  29. Singh, G., Mémoli, F., Carlsson, G.E.: Topological methods for the analysis of high dimensional data sets and 3d object recognition. In: SPBG, pp. 91–100 (2007)

    Google Scholar 

  30. Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably informative multi-scale signature based on heat diffusion. Comput. Graph. Forum 28(5), 1383–1392 (2009)

    Article  Google Scholar 

  31. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)

    Google Scholar 

  32. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  33. Vallet, B., Lévy, B.: Spectral geometry processing with manifold harmonics. Comput. Graph. Forum 27(2), 251–260 (2008)

    Article  Google Scholar 

  34. Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)

    Article  Google Scholar 

  35. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant Nos. 61872316, 61932018, and the National Key R&D Plan of China under Grant No. 2020YFB1708900.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwei Lin .

Editor information

Editors and Affiliations

Appendices

Appendix 1. The Detailed Calculation of Metrics

KL-divergence-Based Metric. KL-divergence, also called relative entropy, is typically employed to measure the difference between two discrete probability distributions \(P = \{p_{i}\}\) and \(Q = \{q_{i}\}\), i.e., \( D_{KL}(P\Vert Q) = \sum _{i} p_{i} log\left( \frac{p_{i}}{q_{i}}\right) .\) To define the KL-divergence-based metric, the original dataset \(\mathcal {H}\) and simplified dataset \(\mathcal {S}\) are represented as \(M_{\mathcal {H}} \in \mathbb {R}^{n \times d}\) and \(M_{\mathcal {S}}\in \mathbb {R}^{m \times d}\), where n and m \((n > m)\) are the number of data points in \(\mathcal {H}\) and \(\mathcal {S}\), respectively. Each row of the matrices stores a data point in \(\mathbb {R}^d\). First, we calculate the probability distribution \(P^{(k)} = \{p^{(k)}_i, i = 1,2,\cdots ,l\}\) for the \(k^{th}\) dimension of the data points in \(\mathcal {H}\), corresponding to the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\). To do this, the \(k^{th}\) column elements of matrix \(M_{\mathcal {H}}\) are normalized into the interval [0, 1], which is divided into l subintervals, (in our implementation, we set \(l=100\)). By counting the number of elements of the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\) whose values lie in the l subintervals, the value distribution histogram is generated, which is then used as the probability distribution \(P^{(k)} = \{p^{(k)}_i, i=1,2,\cdots ,l\}\) of the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\). Similarly, the probability distribution \(Q^{(k)} = \{q^{(k)}_i, i=1,2,\cdots ,l\}\) of the \(k^{th}\) column of matrix \(M_{\mathcal {S}}\) can be calculated. Then, the KL-divergence for the \(k^{th}\) column elements of matrix \(M_{\mathcal {H}}\) and \(M_{\mathcal {S}}\) is \(D_{KL}(P^{(k)}\Vert Q^{(k)}) = \sum _{i} p^{(k)}_{i} log\left( \frac{p^{(k)}_{i}}{q^{(k)}_{i}}\right) \). And the KL-divergence-based metric is defined as

$$\begin{aligned} d_{KL}(\mathcal {H}, \mathcal {S}) = \sqrt{\sum _{k=1}^d (D_{KL}(P^{(k)}\Vert Q^{(k)}))^2}. \end{aligned}$$
(6)

The KL-divergence-based metric (6) transforms the HD dataset into its probability distribution, and measures the difference between the probability distributions of the original and simplified datasets. As the simplified dataset \(\mathcal {S}\) becomes increasingly closer to the original dataset \(\mathcal {H}\), the KL-divergence-based metric \(d_{KL}\) becomes increasingly smaller. When \(\mathcal {S} = \mathcal {H}\), \(d_{KL} = 0\).

Hausdorff Distance. The Hausdorff distance measures how far two sets are away from each other. Suppose \(\mathcal {X}\) and \(\mathcal {Y}\) are two dataset in d-dimensional space, and denote \(D(\mathcal {X},\mathcal {Y}) = \sup _{x\in \mathcal {X}}\inf _{y\in \mathcal {Y}} d(x,y)\) as the maximum distance from an arbitrary point in set \(\mathcal {X}\) to set \(\mathcal {Y}\), where d(xy) is the Euclidean distance between two points x and y. The Hausdorff distance between \(\mathcal {H}\) and \(\mathcal {S}\) is defined as \( d_{H}(\mathcal {H},\mathcal {S}) = max \{D(\mathcal {H},\mathcal {S}),D(\mathcal {S},\mathcal {H})\}\). Note that the simplified dataset \(\mathcal {S}\) is a subset of the original dataset \(\mathcal {H}\), and then, \(D(\mathcal {S},\mathcal {H}) = 0\). Therefore, the Hausdorff distance between \(\mathcal {H}\) and \(\mathcal {S}\) can be calculated by

(7)

That is, the Hausdorff distance is the maximum distance from an arbitrary point in the original dataset \(\mathcal {H}\) to the simplified dataset \(\mathcal {S}\).

Volume of Convex Hull. As the third metric, we use the difference of the volumes of the convex hulls to measure the fidelity of \(\mathcal {S}\) to \(\mathcal {H}\), i.e.,

$$\begin{aligned} d_{V}(\mathcal {H},\mathcal {S}) = \left| V(\mathcal {H}_{conv})-V(\mathcal {S}_{conv}) \right| , \end{aligned}$$
(8)

where \(\mathcal {H}_{conv}\) and \(\mathcal {S}_{conv}\) are the convex hulls of \(\mathcal {H}\) and \(\mathcal {S}\), and \(V(\mathcal {H}_{conv})\) and \(V(\mathcal {S}_{conv})\) are their volumes. Because the volume of a convex hull in an HD space is difficult to calculate, we first perform the DR operation [4] to project the datasets into 3D space. Afterwards, we employ the quickhull algorithm [3] to construct the convex hulls of the two datasets, and calculate their volumes correspondingly.

Appendix 2. The Simplification Procedures on MNIST and Human Face Dataset

Fig. 6.
figure 6

Simplification of the dataset MNIST. (a–d): The simplified datasets by detecting the first 5, 10, 20, 30 LBO eigenvectors, which consist of 82, 120, 196, 280 data points. (e,f): The DR result (using t-SNE) of the original dataset, illustrated by colorful digits (e) and original images (f). (Color figure online)

MNIST: For clarity of the visual analysis, we selected 900 units of the gray-images for the digits ‘0’ to ‘4’, from the MNIST dataset, and resized each image as \(8 \times 8\). The selected images were projected into the 2D-plane using the t-SNE. Whereas in Fig. 6(e), colorful digits are placed at the projected 2D points, in Fig. 6(f), the images themselves are located at the projected points. The simplified datasets are displayed in Fig. 6(a)–6(d). For illustration, their positions in the 2D-plane are extracted from the dimensionality reduced dataset (Fig. 6(e)). In the simplified dataset in Fig. 6(a), which has 82 data points (approximately \(9\%\) of the points of the original dataset), all five categories of the digits are represented. In the simplified dataset in Fig. 6(c), which has 196 data points (approximately \(22\%\) of the points of the original dataset), virtually all of the key features of the dataset are represented, including the subset of digit ‘2’ in the class of digit ‘1’. In Fig. 6(d), the simplified dataset contains 280 data points (approximately \(31\%\) of the points of the original dataset). We can observe that the boundary of each class of data points has formed (Fig. 6(d)), and the geometric shape of each class is similar to that of the corresponding class in the original dataset (Fig. 6(e)).

Fig. 7.
figure 7

The simplified datasets of the human face dataset, after DR by PCA. Each simplified dataset is comprised of the images marked with red frames. (a–e) The simplified datasets by detecting the first 5, 10, 20, 30, 50 LBO eigenvectors, which consist of 16, 40, 95, 150, 270 data points, respectively. (f) The feature points of the first 5 eigenvectors. (Color figure online)

Human Face: The human face dataset contains 698 gray-images(\(64 \times 64\)) of a plaster head sculpture. The images of the dataset were captured by adjusting three parameters: up-down pose, left-right pose, and lighting direction. Therefore, the images in the dataset can be regarded as distributing on a 3D-manifold. Thus, as illustrated in Fig. 7, after the human face dataset is dimensionally reduced into a 2D-plane using PCA, it can observed that the longitudinal axis represents the sculpture’s rotation in the right-left direction, and the horizontal axis indicates the change of lighting.

In Fig. 7, the images marked with red frames constitute the simplified dataset. In the simplified dataset in Fig. 8(a), the critical points at the boundary are captured. In Figs. 8(b)–8(d), increasingly more critical points at the boundary and interior are added into the simplified datasets. In Fig. 8(e), virtually all of the images at the boundary are added into the simplified set. For clarity, we display the feature points of the first five LBO eigenvectors in Fig. 7(f), which indeed capture several ‘extreme’ images in the human face dataset. Specifically, the two feature points of the first eigenvector, \(\phi _1\), are two images with ‘leftmost pose + most lighting’ and ‘rightmost pose + most lighting’. In the feature point set of the second eigenvector, \(\phi _2\), an image with ‘leftmost pose + fewest lighting’ is added. Similarly, in the feature point sets of the eigenvectors \(\phi _3, \phi _4\), and \(\phi _5\), there are more ‘extreme’ images of the pose and lighting (Fig. 7(f)).

Fig. 8.
figure 8

The simplified datasets of the human face dataset generated by the SRS method, after DR by PCA. Each simplified dataset is comprised of the images marked with red frames. (a–e) The simplified datasets which consist of 16, 40, 95, 150, 270 data points, respectively. (Color figure online)

Appendix 3. Additional Comparisons

To compare the two methods in more detail, the simplified datasets by the SRS method with the best performance in the 200 times of samplings are visualized in Fig. 8, after DR by PCA. Compared with the simplified datasets generated by our method in Fig. 7, the simplified datasets by our method contain more data points on the boundaries. It is well known that the boundary points are more important than the inner points of a dataset, so the simplified datasets generated by our method contain more important information than those by the SRS method.

Compared with other existing statistical sampling methods, the proposed method can be applied to a broader range of dataset types. For some sampling methods, such as stratified sampling [2], ‘Tomek links’ [32], and ‘SMOTE’ [11], data classification is required as supplementary information. However, when the data classification labels are missing, the above methods cannot be implemented. Compared with the topological dataset simplification methods, such as Reeb spaces [14], Joint contour nets [8], and Mapper [29], our method also has obvious advantages. For our method, the size of the simplified dataset is easier to control than those of topological methods. More importantly, in some cases, topological methods may generate new data points whose ‘legality’ cannot be guaranteed. For example, the simplified human face dataset obtained by topological methods can not guarantee that each data point is a ‘correct’ plaster photo. Therefore, topological simplification methods are suitable for extracting the topological structure of a given dataset, but unsuitable for generating simplified datasets with specified numbers.

Appendix 4. Additional Applications

Speedup of DR: The computation of the DR algorithms for an HD dataset is typically complicated. Given an original dataset \(\mathcal {H}\), if we first perform the DR on the simplified dataset \(\mathcal {S}\), and then employ the ‘out-of-sample’ algorithm [6] to calculate the DR coordinates of the remaining data points, the computation requirement can be reduced significantly. Because the simplified dataset \(\mathcal {S}\) captures the feature points of the original dataset, the result generated by DR on simplified set + ‘out-of-sample’ is similar as that by DR on the original set. In Table 2, the running time for the methods ‘DR on simplified set + out-of-sample’ and ‘DR on original set’ are listed. The running-time savings of one orders of magnitude.

Table 2. Timing for speedup of DR (in seconds).

Training Data Simplification: With the increase of the use of training data in supervised learning, the cost of manual annotation has become increasingly more expensive. Moreover, repeated samples or ‘bad’ samples can be added into the training dataset, adding confusion and complication into the training data, and influencing the convergence of the training. Therefore, in supervised and active learning, simplifying the training dataset by retaining the feature points and deleting the repeated or ‘bad’ samples can lighten the burden of the manual annotation, reduce the number of epochs, and the save storage space.

Fig. 9.
figure 9

a) The diagram of simplification rate \(r_i\) v.s. test accuracy; (b) The diagram of number of epochs v.s. test accuracy, the arrows indicate the places where the training processes stop.

First, we studied the relationship between the simplification rate and test accuracy on MNIST and CIFAR-10. In Fig. 9(a), the diagram of the simplification rate \(r_i\) v.s. test accuracy is illustrated. For the dataset MNIST, when \(r_i\) = 0.177, the test accuracy achieves \(97.4\%\), which is close to the test accuracy of \(98.98\%\) using the original training data. For the dataset CIFAR-10, when \(r_i\) = 0.584, the test accuracy is \(78.2\%\), whereas using the original training images, the test accuracy is \(80.2\%\). Therefore, with the proposed simplification method, the simplified dataset is sufficient for training in the two image classification tasks.

Furthermore, with the simplified training set, the training process can terminate earlier than that with the original training set, thus reducing the number of epochs. In Fig. 9(b), the diagrams of the number of epochs v.s. test accuracy are illustrated. The ‘early stopping’ criterion proposed in [26] was used to determine the stopping time (indicated by arrows). For the dataset MNIST, the training process with the original training set stopped at 12000 epochs, with a test accuracy \(99.07\%\). With the simplified training set, the training process stopped as early as at 6000 epochs, with a test accuracy \(95.25\%\). For the dataset CIFAR-10, the training process stopped at 8000 epochs with the simplified training set, whereas the training process with the original training set stopped at 11500 epochs. Therefore, with the simplified training dataset, the training epochs can be reduced significantly.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, C., Lin, H. (2021). High-Dimensional Dataset Simplification by Laplace-Beltrami Operator. In: Magnenat-Thalmann, N., et al. Advances in Computer Graphics. CGI 2021. Lecture Notes in Computer Science(), vol 13002. Springer, Cham. https://doi.org/10.1007/978-3-030-89029-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89029-2_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89028-5

  • Online ISBN: 978-3-030-89029-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics