Skip to main content
Log in

A rotation robust shape transformer for cartoon character recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Recognizing cartoon characters accurately is important for animators to design and create cartoon scenarios by utilizing existing cartoon materials. Current deep learning approaches are sensitive to image rotation and heavily rely on rich textures that rarely exist in cartoon figures. In order to address this problem, the focus of our work is on the distinct nature of shapes, which mostly encodes the geometric structure of contours, rendering more discriminative and robust features than textures. We propose a rotation robust shape transformer for cartoon character recognition. As the filters in deep learning hardly detect discriminative gradient information in cartoon figures, we leverage multi-scale shape context (SC) to obtain the geometry of contour sampling points other than differences in gray level. Further, we propose a rotation-invariant positional encoding to depict the geometric relations of local shape features. The contributions of the different scales of SC templates are learned by attention-based transformer encoder. The obtained network is able to learn shape information effectively from cartoon contours only. The simplistic design attains surprisingly nearly 100% recognition accuracy, which beats both handcrafted and deep learning methods on the proposed challenging Cartoon dataset and traditional datasets. In particular, we gain 86.19% recognition accuracy on rotation test set, rendering an overwhelming superiority of 58.30 percentage higher than the state-of-the-art methods. Moreover, we develop an online cartoon character recognition application for animation scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated or analyzed during the current study are available on Google drive (https://drive.google.com/drive/folders/1vhw907BYVosw7wMKmhD7CAe4x0NbenIG?usp=sharing).

Notes

  1. The histogram is one of the most commonly used.

  2. We provide the introduction video of the application in the supplement material.

  3. We provide more instances in supplement material.

References

  1. Yu, J., Liu, D., Tao, D., Seah, H.S.: On combining multiple features for cartoon character retrieval and clip synthesis. IEEE Trans. Syst. Man Cybern. B (Cybern.) 42(5), 1413–1427 (2012)

    Article  Google Scholar 

  2. Rios, E.A., Cheng, W.-H., Lai, B.-C.: Daf: re: A challenging, crowd-sourced, large-scale, long-tailed dataset for anime character recognition. arXiv preprint arXiv:2101.08674 (2021)

  3. Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  4. Wang, X., et al.: Domain selectivity in the parahippocampal gyrus is predicted by the same structural connectivity patterns in blind and sighted individuals. J. Neurosci. 37(18), 4705–4716 (2017)

    Article  Google Scholar 

  5. Geirhos, R., et al.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 (2018)

  6. Belongie, S., Malik, J., Puzicha, J.: Shape context: a new descriptor for shape matching and object recognition. Adv. Neural Inf. Process. Syst. 13, 831–837 (2001)

    Google Scholar 

  7. Shekar, B., Pilar, B., Kittler, J.: An unification of inner distance shape context and local binary pattern for shape representation and classification. In: Proceedings of the 2nd International Conference on Perception and Machine Intelligence, pp. 46–55 (2015)

  8. Wang, X., Feng, B., Bai, X., Liu, W., Latecki, L.J.: Bag of contour fragments for robust shape classification. Pattern Recognit. 47, 2116–2125 (2014)

    Article  Google Scholar 

  9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)

    Google Scholar 

  10. Li, Y., Lao, L., Cui, Z., Shan, S., Yang, J.: Graph jigsaw learning for cartoon face recognition. arXiv:2107.06532 (2021)

  11. Ritter, S., Barrett, D.G., Santoro, A., Botvinick, M.M.: Cognitive psychology for deep neural networks: a shape bias case study. In: International conference on machine learning, pp. 2940–2949 (2017)

  12. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 30, 3856–3866 (2017)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  14. Haseyama, M., Matsumura, A.: A cartoon character retrieval system including trainable scheme. In: Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), vol. 3, pp. III–37 (2003)

  15. Hu, R., Jia, W., Ling, H., Zhao, Y., Gui, J.: Angular pattern and binary angular pattern for shape retrieval. IEEE Trans. Image Process. 23, 1118–1127 (2014)

    Article  MathSciNet  Google Scholar 

  16. Wang, J., Bai, X., You, X., Liu, W., Latecki, L.J.: Shape matching and classification using height functions. Pattern Recognit. Lett. 33, 134–143 (2012)

    Article  Google Scholar 

  17. Jia, Q., et al.: Hierarchical projective invariant contexts for shape recognition. Pattern Recognit. 52, 358–374 (2016)

    Article  Google Scholar 

  18. Chen, S., Xia, R., Zhao, J., Chen, Y., Hu, M.: A hybrid method for ellipse detection in industrial images. Pattern Recognit. 68, 82–98 (2017)

    Article  Google Scholar 

  19. Micusik, B., Wildenauer, H.: Structure from motion with line segments under relaxed endpoint constraints. Int. J. Comput. Vis. 124, 65–79 (2017)

    Article  MathSciNet  Google Scholar 

  20. Yu, Q., et al.: Sketch me that shoe. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 799–807 (2016)

  21. Sarvadevabhatla, R.K., Kundu, J., Babu, R.V.: Enabling my robot to play pictionary: recurrent neural networks for sketch recognition. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 247–251 (2016)

  22. Yu, Q., et al.: Sketch-a-net: a deep neural network that beats humans. Int. J. Comput. Vis. 122, 411–425 (2017)

    Article  MathSciNet  Google Scholar 

  23. Wang, T.-Q., Liu, C.-L.: Fully convolutional network based skeletonization for handwritten Chinese characters. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2540–2547 (2018)

  24. Lee, S.H., Chan, C.S., Mayo, S.J., Remagnino, P.: How deep learning extracts and learns leaf features for plant classification. Pattern Recognit. 71, 1–13 (2017)

    Article  Google Scholar 

  25. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3d classification and segmentation. In: CVPR, pp. 652–660 (2017)

  26. Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024 (2011)

  27. Xu, P., et al.: SketchMate: deep hashing for million-scale human sketch retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8090–8098 (2018)

  28. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  29. Miyagi, R., Aono, M.: Sliced voxel representations with LSTM and CNN for 3D shape recognition. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 320–323. IEEE (2017)

  30. Dai, G., Xie, J., Fang, Y.: Siamese CNN-BiLSTM architecture for 3D shape representation learning. In: IJCAI, pp. 670–676 (2018)

  31. Carion, N., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020)

  32. Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

  33. Wang, X., Girshick, R., Gupta, A. He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  34. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017)

  35. Vaswani, A., Guyon, I., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) NIPS, vol. 30. Curran Associates Inc., Red Hook (2017)

    Google Scholar 

  36. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL (2018)

  37. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.: Attentive language models beyond a fixed-length context, Transformer-xl. arXiv:1901.02860 (2019)

  38. Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit position encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

  39. Srinivas, A., et al.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)

  40. Söderkvist, O.: Computer vision classification of leaves from Swedish trees. Master’s thesis (2001)

  41. Bai, X., Liu, W., Tu, Z.: Integrating contour and skeleton for shape classification. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 360–367 (2009)

  42. Li, F.-F., Andreetto, M., Ranzato, M.A.: Caltech101 image dataset. http://www.vision.caltech.edu/Image_Datasets/Caltech101/ (2003)

  43. Simard, P.Y., Steinkraus, D., Platt, J.C., et al.: Best practices for convolutional neural networks applied to visual document analysis. In: Icdar, vol. 3 (2003)

  44. Hu, R.-X., Jia, W., Zhao, Y., Gui, J.: Perceptually motivated morphological strategies for shape retrieval. Pattern Recognit. 45, 3222–3230 (2012)

    Article  Google Scholar 

  45. Sirin, Y., Demirci, M.F.: 2D and 3D shape retrieval using skeleton filling rate. Multimed. Tools Appl. 76, 7823–7848 (2017)

    Article  Google Scholar 

  46. Shen, W., Du, C., Jiang, Y., Zeng, D., Zhang, Z.: Bag of shape features with a learned pooling function for shape recognition. Pattern Recognit. Lett. 106, 33–40 (2018)

    Article  Google Scholar 

  47. Ling, H., Jacobs, D.W.: Shape classification using the inner-distance. IEEE Trans. Pattern Anal. Mach. Intell. 29, 286–299 (2007)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China under Grant 62272083 and Grant 61876030, in part by the Liaoning Provincial Natural Science Foundation under Grant 2022-MS-128, in part by the Fundamental Research Funds for the Central Universities DUT23YG109, and in part by the U.S. National Science Foundation under Grant I IS-1814745.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Fan.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “A Rotation Robust Shape Transformer for Cartoon Character Recognition.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 8730 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, Q., Chen, X., Wang, Y. et al. A rotation robust shape transformer for cartoon character recognition. Vis Comput 40, 5575–5588 (2024). https://doi.org/10.1007/s00371-023-03123-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03123-2

Keywords

Navigation