×

Adaptive down-sampling and dimension reduction in time elastic kernel machines for efficient recognition of isolated gestures. (English) Zbl 1477.68289

Guillet, Fabrice (ed.) et al., Advances in knowledge discovery and management. Volume 6. Selected papers based on the presentations at the “Extraction et gestion des connaissances” conferences, EGC 2014 and EGC 2015, Rennes, France in January 2014 and Luxembourg in January 2015. Cham: Springer. Stud. Comput. Intell. 665, 39-59 (2017).
Summary: In the scope of gestural action recognition, the size of the feature vector representing movements is in general quite large especially when full body movements are considered. Furthermore, this feature vector evolves during the movement performance so that a complete movement is fully represented by a matrix \(M\) of size \(DxT\), whose element \(M_{i,j}\) represents the value of feature \(i\) at timestamps \(j\). Many studies have addressed dimensionality reduction considering only the size of the feature vector lying in \(\mathbb{R}^D\) to reduce both the variability of gestural sequences expressed in the reduced space, and the computational complexity of their processing. In return, very few of these methods have explicitly addressed the dimensionality reduction along the time axis. Yet this is a major issue when considering the use of elastic distances which are characterized by a quadratic complexity along the time axis. We present in this paper an evaluation of straightforward approaches aiming at reducing the dimensionality of the matrix \(M\) for each movement, leading to consider both the dimensionality reduction of the feature vector as well as its reduction along the time axis. The dimensionality reduction of the feature vector is achieved by selecting remarkable joints in the skeleton performing the movement, basically the extremities of the articulatory chains composing the skeleton. The temporal dimensionality reduction is achieved using either a regular or adaptive down-sampling that seeks to minimize the reconstruction error of the movements. Elastic and Euclidean kernels are then compared through support vector machine learning. Two data sets that are widely referenced in the domain of human gesture recognition, and quite distinctive in terms of quality of motion capture, are used for the experimental assessment of the proposed approaches. On these data sets we experimentally show that it is feasible, and possibly desirable, to significantly reduce simultaneously the size of the feature vector and the number of skeleton frames to represent body movements while maintaining a very good recognition rate. The method proves to give satisfactory results at a level currently reached by state-of-the-art methods on these data sets. We experimentally show that the computational complexity reduction that is obtained makes this approach eligible for real-time applications.
For the entire collection see [Zbl 1423.68025].

MSC:

68T10 Pattern recognition, speech recognition
68T05 Learning and adaptive systems in artificial intelligence

References:

[1] Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15, 1373-1396. · Zbl 1085.68119 · doi:10.1162/089976603321780317
[2] Berg, C., Christensen, J. P. R., & Ressel, P. (1984). Harmonic analysis on semigroups: Theory of positive definite and related functions (Vol. 100). Graduate texts in mathematics. New York: Springer. · Zbl 0619.43001
[3] Bissacco, A., Chiuso, A., & Soatto, S. (2007). Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 1958-1972. · doi:10.1109/TPAMI.2007.1101
[4] Blackburn, J., & Ribeiro, E. (2007). Human motion recognition using isomap and dynamic time warping. In A. Elgammal, B. Rosenhahn, & R. Klette (Eds.), Human motion—understanding, modeling, capture and animation (Vol. 4814, pp. 285-298). Lecture notes in computer science. Berlin: Springer.
[5] Chang, C. C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1-27:27. · doi:10.1145/1961189.1961199
[6] CMU. (2003). Motion capture database, Carnegie Mellon University. http://mocap.cs.cmu.edu/.
[7] Cuturi, M., Vert, J.-P., Birkenes, O., & Matsui, T. (2007). A kernel for time series based on global alignments. In Proceedings of ICASSP 2007 (pp. II-413-II-416). Honolulu: IEEE.
[8] de Aguiar, E., & Theobalt, C. (2006). Automatic learning of articulated skeletons from 3D marker trajectories. In G. Bebis, et al. (Eds.), ISVC (Vol. 4291, pp. 485-494). Lecture notes in computer science. Berlin: Springer.
[9] Dupont, M., & Marteau, P.-F. (2015). Coarse-DTW: exploiting sparsity in gesture time series. In A. Douzal-Chouakria, et al. (Eds.), Advanced Analytics and Learning on Temporal Data (AALTD), Proceedings of the 1st International Workshop on Advanced Analytics and Learning on Temporal Data (AALTD) (Vol. 1425). Porto, Portugal: CEUR Workshop Proceedings.
[10] Fothergill, S., Mentis, H., Kohli, P., & Nowozin, S. (2012). Instructing people for training gestural interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems CHI 2012, (pp. 1737-1746). New York: ACM.
[11] Gibet, S., Marteau, P. -F., & Duarte, K. (2011). Toward a motor theory of sign language perception. In E. Efthimiou, G. Kouroupetroglou, & S. -E. Fotinea (Eds.), Gesture Workshop (Vol. 7206, pp. 161-172). Lecture notes in computer science Berlin: Springer.
[12] Giese, M. A., Thornton, I., & Edelman, S. (2008). Metrics of the perception of body movement. Journal of Vision, 8(9), 1-18. Reviewed. · doi:10.1167/8.9.13
[13] Han, L., Wu, X., Liang, W., Hou, G., & Jia, Y. (2010). Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing, 28(5), 836-849. · doi:10.1016/j.imavis.2009.08.003
[14] He, X., & Niyogi, P. (2003). Locality preserving projections (Vol. 16). Advances in neural information processing systems. Cambridge: MIT Press.
[15] Hussain, S., & Rashid, A. (2012). User independent hand gesture recognition by accelerated DTW. In International Conference on Informatics, Electronics Vision (ICIEV) (pp. 1033-1037).
[16] Hussein, M. E., Torki, M., Gowayyed, M. A., & El-Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In IJCAI.
[17] Jenkins, O. C. & Matarić, M. J. (2004). A spatio-temporal extension to isomap nonlinear dimension reduction. In The International Conference on Machine Learning (ICML 2004), (pp. 441-448).
[18] Jolliffe, I. (1986). Principal component analysis. Springer series in statistics. New York: Springer.
[19] Keogh, E. J. & Pazzani, M. J. (2000). Scaling up dynamic time warping for datamining applications. In Proceedings of the Sixth ACM SIGKDD KDD 2000 (pp. 285-289). New York.
[20] Kruskal, J., & Wish, M. (1978). Multidimensional scaling. Beverly Hills: Sage Publications. · doi:10.4135/9781412985130
[21] Larochelle, H., Mandel, M., Pascanu, R., & Bengio, Y. (2012). Learning algorithms for the classification restricted Boltzmann machine. Journal of Machine Learning Research, 13, 643-669. · Zbl 1283.68293
[22] Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In Proceedings of IEEE International Workshop on CVPR for Human Communicative Behavior Analysis (pp. 9-14). In Press.
[23] Marteau, P. F. (2009). Time warp edit distance with stiffness adjustment for time series matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 306-318. · doi:10.1109/TPAMI.2008.76
[24] Marteau, P.-F., & Gibet, S. (2006). Adaptive sampling of motion trajectories for discrete task-based analysis and synthesis of gesture. In S. Gibet, N. Courty, & J.-F. Kamp (Eds.), Gesture in human-computer interaction and simulation (Vol. 3881, pp. 224-235). Lecture notes in computer science. Springer: Berlin.
[25] Marteau, P.-F. & Gibet, S. (2014). On recursive edit distance kernels with application to time series classification. IEEE Transactions on Neural Networks and Learning Systems, 1-14.
[26] Marteau, P.-F., & Ménier, G. (2009). Speeding up simplification of polygonal curves using nested approximations. Pattern Analysis and Applications, 12(4), 367-375. · Zbl 1422.68252 · doi:10.1007/s10044-008-0133-y
[27] Martens, J. & Sutskever, I. (2011). Learning recurrent neural networks with hessian-free optimization. In ICML (pp. 1033-1040).
[28] Masoud, O., & Papanikolopoulos, N. (2003). A method for human action recognition. Image and Vision Computing, 21(8), 729-743. · doi:10.1016/S0262-8856(03)00068-4
[29] McLachlan, G. (2004). Discriminant analysis and statistical pattern recognition. Probability and statistics. New York: Wiley. · Zbl 1108.62317
[30] Mitra, S., & Acharya, T. (2007). Gesture recognition: a survey. Transactions on Systems, Man, and Cybernetics, Part C, 37(3), 311-324. · doi:10.1109/TSMCC.2007.893280
[31] Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation mocap database HDM05. Technical report CG-2007-2, Universität Bonn.
[32] O’Brien, J. F., Bodenheimer, R. E., Brostow, G. J., & Hodgins, J. K. (2000). Automatic joint parameter estimation from magnetic motion capture data. In Proceedings of Graphics Interface (Vol. 2000, pp. 53-60).
[33] Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2012). Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. In CVPR Workshops (pp. 8-13). IEEE.
[34] Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 1-20.
[35] Oreifej, O. & Liu, Z. (2013). HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In 2013 IEEE CVPR (pp. 716-723).
[36] Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323-2326. · doi:10.1126/science.290.5500.2323
[37] Sakoe, H. & Chiba, S. (1971). A dynamic programming approach to continuous speech recognition. In Proceedings of the 7th International Congress of Acoustic (pp. 65-68).
[38] Sempena, S., Maulidevi, N., & Aryan, P. (2011). Human action recognition using dynamic time warping. In International Conference on Electrical Engineering and Informatics (ICEEI) (pp. 1-5).
[39] Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Conference on Computer Vision and Pattern Recognition CVPR 2011 (pp. 1297-1304). IEEE.
[40] Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319. · doi:10.1126/science.290.5500.2319
[41] Veeraraghavan, A., Chowdhury, A. K. R., & Chellappa, R. (2004). Role of shape and kinematics in human movement analysis. In CVPR (Vol. 1, pp. 730-737).
[42] Velichko, V. M., & Zagoruyko, N. G. (1970). Automatic recognition of 200 words. International Journal of Man-Machine Studies, 2, 223-234. · doi:10.1016/S0020-7373(70)80008-6
[43] Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In IEEE International Conference CVPR (pp. 1290-1297).
[44] Wang, S. B., Quattoni, A., Morency, L., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. In IEEE International Conference CVPR (Vol. 2, pp. 1521-1527).
[45] Yu, E. & Aggarwal, J. (2009). Human action recognition with extremities as semantic posture representation. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (pp. 1-8).
[46] Zhao, X. · doi:10.1007/978-3-642-31968-6_89
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.