Abstract
Recognizing cartoon characters accurately is important for animators to design and create cartoon scenarios by utilizing existing cartoon materials. Current deep learning approaches are sensitive to image rotation and heavily rely on rich textures that rarely exist in cartoon figures. In order to address this problem, the focus of our work is on the distinct nature of shapes, which mostly encodes the geometric structure of contours, rendering more discriminative and robust features than textures. We propose a rotation robust shape transformer for cartoon character recognition. As the filters in deep learning hardly detect discriminative gradient information in cartoon figures, we leverage multi-scale shape context (SC) to obtain the geometry of contour sampling points other than differences in gray level. Further, we propose a rotation-invariant positional encoding to depict the geometric relations of local shape features. The contributions of the different scales of SC templates are learned by attention-based transformer encoder. The obtained network is able to learn shape information effectively from cartoon contours only. The simplistic design attains surprisingly nearly 100% recognition accuracy, which beats both handcrafted and deep learning methods on the proposed challenging Cartoon dataset and traditional datasets. In particular, we gain 86.19% recognition accuracy on rotation test set, rendering an overwhelming superiority of 58.30 percentage higher than the state-of-the-art methods. Moreover, we develop an online cartoon character recognition application for animation scenarios.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated or analyzed during the current study are available on Google drive (https://drive.google.com/drive/folders/1vhw907BYVosw7wMKmhD7CAe4x0NbenIG?usp=sharing).
Notes
The histogram is one of the most commonly used.
We provide the introduction video of the application in the supplement material.
We provide more instances in supplement material.
References
Yu, J., Liu, D., Tao, D., Seah, H.S.: On combining multiple features for cartoon character retrieval and clip synthesis. IEEE Trans. Syst. Man Cybern. B (Cybern.) 42(5), 1413–1427 (2012)
Rios, E.A., Cheng, W.-H., Lai, B.-C.: Daf: re: A challenging, crowd-sourced, large-scale, long-tailed dataset for anime character recognition. arXiv preprint arXiv:2101.08674 (2021)
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Wang, X., et al.: Domain selectivity in the parahippocampal gyrus is predicted by the same structural connectivity patterns in blind and sighted individuals. J. Neurosci. 37(18), 4705–4716 (2017)
Geirhos, R., et al.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 (2018)
Belongie, S., Malik, J., Puzicha, J.: Shape context: a new descriptor for shape matching and object recognition. Adv. Neural Inf. Process. Syst. 13, 831–837 (2001)
Shekar, B., Pilar, B., Kittler, J.: An unification of inner distance shape context and local binary pattern for shape representation and classification. In: Proceedings of the 2nd International Conference on Perception and Machine Intelligence, pp. 46–55 (2015)
Wang, X., Feng, B., Bai, X., Liu, W., Latecki, L.J.: Bag of contour fragments for robust shape classification. Pattern Recognit. 47, 2116–2125 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)
Li, Y., Lao, L., Cui, Z., Shan, S., Yang, J.: Graph jigsaw learning for cartoon face recognition. arXiv:2107.06532 (2021)
Ritter, S., Barrett, D.G., Santoro, A., Botvinick, M.M.: Cognitive psychology for deep neural networks: a shape bias case study. In: International conference on machine learning, pp. 2940–2949 (2017)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 30, 3856–3866 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Haseyama, M., Matsumura, A.: A cartoon character retrieval system including trainable scheme. In: Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), vol. 3, pp. III–37 (2003)
Hu, R., Jia, W., Ling, H., Zhao, Y., Gui, J.: Angular pattern and binary angular pattern for shape retrieval. IEEE Trans. Image Process. 23, 1118–1127 (2014)
Wang, J., Bai, X., You, X., Liu, W., Latecki, L.J.: Shape matching and classification using height functions. Pattern Recognit. Lett. 33, 134–143 (2012)
Jia, Q., et al.: Hierarchical projective invariant contexts for shape recognition. Pattern Recognit. 52, 358–374 (2016)
Chen, S., Xia, R., Zhao, J., Chen, Y., Hu, M.: A hybrid method for ellipse detection in industrial images. Pattern Recognit. 68, 82–98 (2017)
Micusik, B., Wildenauer, H.: Structure from motion with line segments under relaxed endpoint constraints. Int. J. Comput. Vis. 124, 65–79 (2017)
Yu, Q., et al.: Sketch me that shoe. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 799–807 (2016)
Sarvadevabhatla, R.K., Kundu, J., Babu, R.V.: Enabling my robot to play pictionary: recurrent neural networks for sketch recognition. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 247–251 (2016)
Yu, Q., et al.: Sketch-a-net: a deep neural network that beats humans. Int. J. Comput. Vis. 122, 411–425 (2017)
Wang, T.-Q., Liu, C.-L.: Fully convolutional network based skeletonization for handwritten Chinese characters. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2540–2547 (2018)
Lee, S.H., Chan, C.S., Mayo, S.J., Remagnino, P.: How deep learning extracts and learns leaf features for plant classification. Pattern Recognit. 71, 1–13 (2017)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3d classification and segmentation. In: CVPR, pp. 652–660 (2017)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024 (2011)
Xu, P., et al.: SketchMate: deep hashing for million-scale human sketch retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8090–8098 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Miyagi, R., Aono, M.: Sliced voxel representations with LSTM and CNN for 3D shape recognition. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 320–323. IEEE (2017)
Dai, G., Xie, J., Fang, Y.: Siamese CNN-BiLSTM architecture for 3D shape representation learning. In: IJCAI, pp. 670–676 (2018)
Carion, N., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020)
Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Wang, X., Girshick, R., Gupta, A. He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017)
Vaswani, A., Guyon, I., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) NIPS, vol. 30. Curran Associates Inc., Red Hook (2017)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL (2018)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.: Attentive language models beyond a fixed-length context, Transformer-xl. arXiv:1901.02860 (2019)
Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit position encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)
Srinivas, A., et al.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
Söderkvist, O.: Computer vision classification of leaves from Swedish trees. Master’s thesis (2001)
Bai, X., Liu, W., Tu, Z.: Integrating contour and skeleton for shape classification. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 360–367 (2009)
Li, F.-F., Andreetto, M., Ranzato, M.A.: Caltech101 image dataset. http://www.vision.caltech.edu/Image_Datasets/Caltech101/ (2003)
Simard, P.Y., Steinkraus, D., Platt, J.C., et al.: Best practices for convolutional neural networks applied to visual document analysis. In: Icdar, vol. 3 (2003)
Hu, R.-X., Jia, W., Zhao, Y., Gui, J.: Perceptually motivated morphological strategies for shape retrieval. Pattern Recognit. 45, 3222–3230 (2012)
Sirin, Y., Demirci, M.F.: 2D and 3D shape retrieval using skeleton filling rate. Multimed. Tools Appl. 76, 7823–7848 (2017)
Shen, W., Du, C., Jiang, Y., Zeng, D., Zhang, Z.: Bag of shape features with a learned pooling function for shape recognition. Pattern Recognit. Lett. 106, 33–40 (2018)
Ling, H., Jacobs, D.W.: Shape classification using the inner-distance. IEEE Trans. Pattern Anal. Mach. Intell. 29, 286–299 (2007)
Acknowledgements
This work was supported in part by the Natural Science Foundation of China under Grant 62272083 and Grant 61876030, in part by the Liaoning Provincial Natural Science Foundation under Grant 2022-MS-128, in part by the Fundamental Research Funds for the Central Universities DUT23YG109, and in part by the U.S. National Science Foundation under Grant I IS-1814745.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “A Rotation Robust Shape Transformer for Cartoon Character Recognition.”
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 8730 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jia, Q., Chen, X., Wang, Y. et al. A rotation robust shape transformer for cartoon character recognition. Vis Comput 40, 5575–5588 (2024). https://doi.org/10.1007/s00371-023-03123-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03123-2