A transformer model for boundary detection in continuous sign language

143 Accesses
2 Citations
Explore all metrics

Abstract

Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical I3D for Sign Spotting

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

Article 10 April 2024

ECCV 2022 Sign Spotting Challenge: Dataset, Design and Results

Data availability

Not applicable.

Code availability

Not applicable.

References

Rastgoo R, Kiani K, Escalera S (2020) Sign language recognition: a deep survey. Exp Syst Appl 164:113794. https://doi.org/10.1016/j.eswa.2020.113794
Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M (2024) A survey on recent advances in Sign Language Production. Exp Syst Appl 243:122846. https://doi.org/10.1016/j.eswa.2023.122846
Núñez-Marcos A, Perez-de-Viñaspre O, Labaka G (2023) A survey on sign language machine translation. Exp Syst Appl 213(Part B):118993
Article�� Google Scholar
Rastgoo R, Kiani K, Escalera S, Sabokrou M (2024) Multi-modal zero-shot dynamic hand gesture recognition. Exp Syst Appl 247:123349. https://doi.org/10.1016/j.eswa.2024.123349
Rastgoo R, Kiani K, Escalera S (2024) Word separation in continuous sign language using isolated signs and post-processing. Exp Syst Appl 249(Part B):123695
Rastgoo R, Kiani K, Escalera S (2022) Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Human Comput 13(1):591–611. https://doi.org/10.1007/s12652-021-02920-8
Article Google Scholar
Alzubaidi L et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:53
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2023) ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimed Tools Appl 82(28):43781–43796. https://doi.org/10.1007/s11042-023-15112-7
Article Google Scholar
Rezaei M, Rastgoo M, Athitsos V (2023) TriHorn-Net: a model for accurate depth-based 3D hand pose estimation. Exp Syst Appl 223:119922. https://doi.org/10.1016/j.eswa.2023.119922
Rastgoo R, Kiani K, Escalera S (2023) A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82(1):1401–1429. https://doi.org/10.1007/s11042-022-13573-w
Article Google Scholar
Rastgoo R, Kiani K, Escalera S, Sabokrou M (2021) Sign language production: a review. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE, pp 3451–3461
Rastgoo R, Kiani K, Escalera S (2021) Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80(1):127–163. https://doi.org/10.1007/s11042-020-09700-0
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79(31–32):22965–22987. https://doi.org/10.1007/s11042-020-09048-5
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20(11):809. https://doi.org/10.3390/e20110809
Article Google Scholar
Mohammadi Z, Akhavanpour A, Rastgoo R, Sabokrou M (2023) Diverse hand gesture recognition dataset. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-17268-8
Nada B, Ibrahim HHZ, MMS (2020) Advances, challenges and opportunities in continuous sign language recognition. J Eng Appl Sci 15(5):1205–1227
Bird JJ, Ekárt A, Faria DR (2020) British sign language recognition via late fusion of computer vision and leap motion with transfer learning to american sign language. Sensors (Switzerland) 20(18):1–19. https://doi.org/10.3390/s20185151
Article Google Scholar
Halvardsson G et al (2021) Interpretation of Swedish sign language using convolutional neural networks and transfer learning. SN Comput Sci 2(3):1–3. https://doi.org/10.1007/s42979-021-00612-w
Article Google Scholar
Jiang X et al (2020) Fingerspelling Identification for Chinese Sign Language via AlexNet-Based Transfer Learning and Adam Optimizer. Sci Program 2020. https://doi.org/10.1155/2020/3291426.
Sharma S, Gupta R, Kumar A (2021) Continuous sign language recognition using isolated signs data and deep transfer learning’, J Ambient Intell Human Comput. Springer Berlin Heidelberg (2020) https://doi.org/10.1007/s12652-021-03418-z
Boris M, Turner GH, Lohan KS, Hastie H (2017) Towards continuous sign language recognition with deep learning. The Heriot-Watt University School. https://api.semanticscholar.org/CorpusID:5525834
Papastratis I, Dimitropoulos K, Daras P (2021) continuous sign language recognition through a context-aware generative adversarial network. Sensors (Basel) 21(1):2437. https://doi.org/10.3390/s21072437
Article Google Scholar
Koishybay K, Mukushev M, Sandygulova A (2021) Continuous sign language recognition with iterative spatiotemporal fine-tuning. In: 25th International Conference on Pattern Recognition (ICPR). IEEE, Milan, Italy. https://doi.org/10.1109/ICPR48806.2021.9412364
Chapter Google Scholar
Cui R, Liu H, Zhang CH (2017) Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, USA, pp 7361–7369
Google Scholar
Zuo R, Mak B (2022) C2SLR: consistency-enhanced continuous sign language recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, pp 5131–140
Zhou H et al (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) 35:11106–11115
Dosovitskiy A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations (ICLR)
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Salt Lake City, USA, pp 7794–7803
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: 16th European Computer Vision Association (ECCV). ECCV, Glasgow, UK
Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. In: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada
Wang H, Zhu Y, Green B, Adam H, Yuille A, Chen LCh (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: 16th European Computer Vision Association (ECCV). Glasgow, UK, pp 108–126
Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, Bharambe A, Maaten L (2018) Exploring the limits of weakly supervised pretraining. In: European Computer Vision Association (ECCV). Glasgow, UK, pp 185–201
Xie Q, Luong MT, Hovy E, Le QV (2020) Self-training with noisy student improves ImageNet classification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, pp 106787–1069
Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (BiT): General visual representation learning. In: 16th European Computer Vision Association (ECCV). Glasgow, UK, pp 491–507. https://doi.org/10.1007/978-3-030-58558-7_29
Rastgoo R, Kiani K, Escalera S (2022) A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences. arXiv:2207.07619
Cao Z, Hidalgo Martinez G, Simon T, Wei S, Sheikh YA (2021) OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 73:172–186
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30(NIPS 2017):5998–6008
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Exp Syst Appl 150:113336
Neidle C, Thangali A, Sclaroff S (2012) Challenges in development of the american sign language lexicon video dataset (ASLLVD) Corpus. IN: 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC 2012. Istanbul, Turkey. Retrieved from http://www.bu.edu/asllrp/av/dai-asllvd.html. Accessed Apr 2024

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Electrical and Computer Engineering Faculty, Semnan University, Semnan, 3513119111, Iran
Razieh Rastgoo & Kourosh Kiani
Sergio Escalera Department of Mathematics and Informatics, Universität de Barcelona, and Computer Vision Center, Barcelona, Spain
Sergio Escalera

Authors

Razieh Rastgoo
View author publications
You can also search for this author in PubMed Google Scholar
Kourosh Kiani
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Ethics approval (include appropriate approvals or waivers)

Not applicable.

Consent for publication

All authors confirm their consent for publication.

Competing Interest

The authors certify that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. A transformer model for boundary detection in continuous sign language. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19079-x

Download citation

Received: 09 July 2023
Revised: 19 January 2024
Accepted: 22 March 2024
Published: 03 April 2024
DOI: https://doi.org/10.1007/s11042-024-19079-x

A transformer model for boundary detection in continuous sign language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical I3D for Sign Spotting

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

ECCV 2022 Sign Spotting Challenge: Dataset, Design and Results

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval (include appropriate approvals or waivers)

Consent for publication

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A transformer model for boundary detection in continuous sign language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical I3D for Sign Spotting

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

ECCV 2022 Sign Spotting Challenge: Dataset, Design and Results

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval (include appropriate approvals or waivers)

Consent for publication

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation