Skip to main content
Log in

A transformer model for boundary detection in continuous sign language

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

References

  1. Rastgoo R, Kiani K, Escalera S (2020) Sign language recognition: a deep survey. Exp Syst Appl 164:113794. https://doi.org/10.1016/j.eswa.2020.113794

  2. Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M (2024) A survey on recent advances in Sign Language Production. Exp Syst Appl 243:122846. https://doi.org/10.1016/j.eswa.2023.122846

  3. Núñez-Marcos A, Perez-de-Viñaspre O, Labaka G (2023) A survey on sign language machine translation. Exp Syst Appl 213(Part B):118993

    Article�� Google Scholar 

  4. Rastgoo R, Kiani K, Escalera S, Sabokrou M (2024) Multi-modal zero-shot dynamic hand gesture recognition. Exp Syst Appl 247:123349. https://doi.org/10.1016/j.eswa.2024.123349

  5. Rastgoo R, Kiani K, Escalera S (2024) Word separation in continuous sign language using isolated signs and post-processing. Exp Syst Appl 249(Part B):123695

  6. Rastgoo R, Kiani K, Escalera S (2022) Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Human Comput 13(1):591–611. https://doi.org/10.1007/s12652-021-02920-8

    Article  Google Scholar 

  7. Alzubaidi L et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:53

    Article  Google Scholar 

  8. Rastgoo R, Kiani K, Escalera S (2023) ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimed Tools Appl 82(28):43781–43796. https://doi.org/10.1007/s11042-023-15112-7

    Article  Google Scholar 

  9. Rezaei M, Rastgoo M, Athitsos V (2023) TriHorn-Net: a model for accurate depth-based 3D hand pose estimation. Exp Syst Appl 223:119922. https://doi.org/10.1016/j.eswa.2023.119922

  10. Rastgoo R, Kiani K, Escalera S (2023) A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82(1):1401–1429. https://doi.org/10.1007/s11042-022-13573-w

    Article  Google Scholar 

  11. Rastgoo R, Kiani K, Escalera S, Sabokrou M (2021) Sign language production: a review. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE, pp 3451–3461

  12. Rastgoo R, Kiani K, Escalera S (2021) Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80(1):127–163. https://doi.org/10.1007/s11042-020-09700-0

    Article  Google Scholar 

  13. Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79(31–32):22965–22987. https://doi.org/10.1007/s11042-020-09048-5

    Article  Google Scholar 

  14. Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20(11):809. https://doi.org/10.3390/e20110809

    Article  Google Scholar 

  15. Mohammadi Z, Akhavanpour A, Rastgoo R, Sabokrou M (2023) Diverse hand gesture recognition dataset. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-17268-8

  16. Nada B, Ibrahim HHZ, MMS (2020) Advances, challenges and opportunities in continuous sign language recognition. J Eng Appl Sci 15(5):1205–1227

  17. Bird JJ, Ekárt A, Faria DR (2020) British sign language recognition via late fusion of computer vision and leap motion with transfer learning to american sign language. Sensors (Switzerland) 20(18):1–19. https://doi.org/10.3390/s20185151

    Article  Google Scholar 

  18. Halvardsson G et al (2021) Interpretation of Swedish sign language using convolutional neural networks and transfer learning. SN Comput Sci 2(3):1–3. https://doi.org/10.1007/s42979-021-00612-w

    Article  Google Scholar 

  19. Jiang X et al (2020) Fingerspelling Identification for Chinese Sign Language via AlexNet-Based Transfer Learning and Adam Optimizer. Sci Program 2020. https://doi.org/10.1155/2020/3291426.

  20. Sharma S, Gupta R, Kumar A (2021) Continuous sign language recognition using isolated signs data and deep transfer learning’, J Ambient Intell Human Comput. Springer Berlin Heidelberg (2020) https://doi.org/10.1007/s12652-021-03418-z

  21. Boris M, Turner GH, Lohan KS, Hastie H (2017) Towards continuous sign language recognition with deep learning. The Heriot-Watt University School. https://api.semanticscholar.org/CorpusID:5525834

  22. Papastratis I, Dimitropoulos K, Daras P (2021) continuous sign language recognition through a context-aware generative adversarial network. Sensors (Basel) 21(1):2437. https://doi.org/10.3390/s21072437

    Article  Google Scholar 

  23. Koishybay K, Mukushev M, Sandygulova A (2021) Continuous sign language recognition with iterative spatiotemporal fine-tuning. In: 25th International Conference on Pattern Recognition (ICPR). IEEE, Milan, Italy. https://doi.org/10.1109/ICPR48806.2021.9412364

    Chapter  Google Scholar 

  24. Cui R, Liu H, Zhang CH (2017) Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, pp 7361–7369

    Google Scholar 

  25. Zuo R, Mak B (2022) C2SLR: consistency-enhanced continuous sign language recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, pp 5131–140

  26. Zhou H et al (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) 35:11106–11115

  27. Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations (ICLR)

  28. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Salt Lake City, USA, pp 7794–7803

  29. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: 16th European Computer Vision Association (ECCV). ECCV, Glasgow, UK

  30. Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. In: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada

  31. Wang H, Zhu Y, Green B, Adam H, Yuille A, Chen LCh (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: 16th European Computer Vision Association (ECCV). Glasgow, UK, pp 108–126

  32. Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, Bharambe A, Maaten L (2018) Exploring the limits of weakly supervised pretraining. In: European Computer Vision Association (ECCV). Glasgow, UK, pp 185–201

  33. Xie Q, Luong MT, Hovy E, Le QV (2020) Self-training with noisy student improves ImageNet classification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, pp 106787–1069

  34. Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (BiT): General visual representation learning. In: 16th European Computer Vision Association (ECCV). Glasgow, UK, pp 491–507. https://doi.org/10.1007/978-3-030-58558-7_29

  35. Rastgoo R, Kiani K, Escalera S (2022) A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences. arXiv:2207.07619

  36. Cao Z, Hidalgo Martinez G, Simon T, Wei S, Sheikh YA (2021) OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 73:172–186

    Article  Google Scholar 

  37. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30(NIPS 2017):5998–6008

  38. Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Exp Syst Appl 150:113336

  39. Neidle C, Thangali A, Sclaroff S (2012) Challenges in development of the american sign language lexicon video dataset (ASLLVD) Corpus. IN: 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC 2012. Istanbul, Turkey. Retrieved from http://www.bu.edu/asllrp/av/dai-asllvd.html. Accessed Apr 2024

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Ethics approval (include appropriate approvals or waivers)

Not applicable.

Consent for publication

All authors confirm their consent for publication.

Competing Interest

The authors certify that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. A transformer model for boundary detection in continuous sign language. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19079-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19079-x

Keywords

Navigation