×

Recognition of human movements from video data. (English. Russian original) Zbl 07591293

J. Comput. Syst. Sci. Int. 61, No. 2, 233-239 (2022); translation from Izv. Ross. Akad. Nauk, Teor. Sist. Upr. 2022, No. 2, 100-106 (2022).

MSC:

68T10 Pattern recognition, speech recognition
68U10 Computing methodologies for image processing
94A08 Image processing (compression, reconstruction, etc.) in information and communication theory

Software:

ViViT; ViT; BERT
Full Text: DOI

References:

[1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus (2014), pp. 1725-1732.
[2] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proceedings of the European Conference on Computer Vision (Springer, Cham, 2016), pp. 20-36.
[3] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proceedings of the European Conference on Computer Vision (ECCV), Munich,2018, pp. 803-818.
[4] Neichev, R. G.; Katrutsa, A. M.; Strizhov, V. V., “Robust selection of multicollinear features in forecasting,” Zavod. Labor, Diagn. Mater., 82, 68-74 (2016)
[5] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “SMART frame selection for action recognition,” arXiv: 2012.10671 (2020).
[6] Agethen, S.; Hsu, W. H., Deep multi-kernel convolutional LSTM networks and an attention-based mechanism for videos, IEEE Trans. Multimedia, 22, 819-829 (2019) · doi:10.1109/TMM.2019.2932564
[7] C. Li, P. Wang, S. Wang, Y. Hou, and W. Li, “Skeleton-based action recognition using LSTM and CNN,” in Proceedings of the IEEE International Conference on Multimedia and Expo Workshops ICMEW (IEEE, Hong Kong, 2017), pp. 585-590.
[8] Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S. W., Action recognition in video sequences using deep bi-directional LSTM with CNN features, IEEE Access, 6, 1155-1166 (2017) · doi:10.1109/ACCESS.2017.2778011
[9] Li, S.; Yi, J.; Farha, Y. A.; Gall, J., Pose refinement graph convolutional network for skeleton-based action recognition, IEEE Robot. Autom. Lett., 6, 1028-1035 (2021) · doi:10.1109/LRA.2021.3056361
[10] W. Peng, J. Shi, Z. Xia, and G. Zhao, “Mix dimension in Poincare geometry for 3D skeleton-based action recognition,” in Proceedings of the 28th ACM International Conference on Multimedia, Seattle,2020, pp. 1432-1440.
[11] R. G. Neichev, “Multimodel forecasting of multiscale time series in internet of things,” in Proceedings of the 11th International Conference on Intelligent Data Processing: Theory and Applications, Barcelona, Spain,2016.
[12] R. Gao, T. H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle,2020, pp. 10457-10467.
[13] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Epic-fusion: Audio-visual temporal binding for egocentric action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul,2019, pp. 5492-5501.
[14] J. Chen and C. M. Ho, “MM-ViT: Multi-modal video transformer for compressed video action recognition,” arXiv: 2108.09322 (2021).
[15] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans,2018.
[16] H. Kwon, M. Kim, S. Kwak, and M. Cho, “Motionsqueeze: Neural motion feature learning for video understanding,” in Proceedings of the European Conference on Computer Vision (Springer, Cham, 2020), pp. 345-362.
[17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” arXiv: 1406.2199 (2014).
[18] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv: 1503.02531 (2015).
[19] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with enhanced motion vector CNNs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas,2016, pp. 2718-2726.
[20] Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H., Real-time action recognition with deeply transferred motion vector CNNs, IEEE Trans. Image Process., 27, 2326-2339 (2018) · doi:10.1109/TIP.2018.2791180
[21] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu,2017, pp. 6299-6308.
[22] Ji, S.; Xu, W.; Yang, M.; Yu, K., 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35, 221-231 (2012) · doi:10.1109/TPAMI.2012.59
[23] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago,2015, pp. 4489-4497.
[24] J. Chen, J. Hsiao, and C. M. Ho, “Residual frames with efficient pseudo-3D CNN for human action recognition,” arXiv: 2008.01057 (2020).
[25] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3D residual networks,” in Proceedings of the IEEE International Conference on Computer Vision, Venice,2017, pp. 5533-5541.
[26] D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul,2019, pp. 5552-5561.
[27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. le Cun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,2018, pp. 6450-6459.
[28] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv: 1606.01847 (2016).
[29] A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” arXiv: 2103.00020 (2021).
[30] H. Luo, L. Ji, M. Zhong, et al., “Clip4clip: An empirical study of clip for end to end video clip retrieval,” arXiv: 2104.08860 (2021).
[31] S. Lee, Y. Yu, G. Kim, et al., “Parameter efficient multimodal transformers for video representation learning,” arXiv: 2012.04124 (2020).
[32] Y. H. H. Tsai, S. Bai, P. P. Liang, et al., “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the Conference of Association for Computational Linguistics Meeting (NIH Public Access, 2019), p. 6558.
[33] A. Zadeh, C. Mao, K. Shi, et al., “Factorized multimodal transformer for multimodal sequential learning,” arXiv: 1911.09826 (2019).
[34] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929 (2020).
[35] A. Arnab, M. Dehghani, G. Heigold, et al., “Vivit: A video vision transformer,” arXiv: 2103.15691 (2021).
[36] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” arXiv: 2102.05095 (2021).
[37] K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,” Center Res. Comput. Vision 2 (11) (2012).
[38] X. Li, C. Liu, Y. Zhang, et al., “VidTr: Video transformer without convolutions,” arXiv: 2104.11746 (2021).
[39] S. Sun, Z. Kuang, L. Sheng, et al., “Optical flow guided feature: A fast and robust motion representation for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,2018, pp. 1390-1399.
[40] Ma, C. Y.; Chen, M. H.; Kira, Z., TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition, Signal Process.: Image Commun., 71, 76-87 (2019)
[41] A. Mazari and H. Sahbi, “MLGCN: Multi-Laplacian graph convolutional networks for human action recognition,” in Proceedings of the British Machine Vision Conference BMVC, Cardiff,2019, p. 281.
[42] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, Colorado Springs (IEEE, 2011), pp. 3337-3344.
[43] R. Zellers and Y. Choi, “Zero-shot activity recognition with verb attribute induction,” arXiv: 1707.09468 (2017).
[44] M. Jain, J. C. van Gemert, T. Mensink, et al., “Objects2Action: Classifying and localizing actions without any video example,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago,2015, pp. 4588-4596.
[45] J. Gao, T. Zhang, and C. Xu, “I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs,” in Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu,2019, Vol. 33, pp. 8303-8311.
[46] C. Gan, M. Lin, Y. Yang, et al., “Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix,2016.
[47] B. Brattoli, J. Tighe, F. Zhdanov, et al., “Rethinking zero-shot video classification: End-to-end training for realistic applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle,2020, pp. 4613-4623.
[48] J. Qin, L. Liu, L. Shao, et al., “Zero-shot action recognition with error-correcting output codes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu,2017, pp. 2833-2842.
[49] Q. Wang and K. Chen, “Alternative semantic representations for zero-shot human action recognition,” in Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, Cham, 2017), pp. 87-102.
[50] S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recognition,” arXiv: 2108.02833 (2021).
[51] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv: 1810.04805 (2018).
[52] S. N. Gowda, L. Sevilla-Lara, K. Kim, F. Keller, and M. Rohrbach, “A new split for evaluating true zero-shot action recognition,” arXiv: 2107.13029 (2021).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.