Preview

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

Advanced search

Automatic sign language translation: a review of neural network methods for recognition and synthesis of spoken and signed language

https://doi.org/10.17586/2226-1494-2024-24-5-669-686

Abstract

A review of modern methods and technologies for automatic machine translation for the deaf and hard of hearing is presented, including recognition and synthesis of both spoken and sign languages. These methods aim to facilitate effective communication between deaf/hard-of-hearing and hearing individuals. The proposed solutions have potential applications in contemporary human-machine interaction interfaces. Key aspects of new technologies are examined, including methods for sign language recognition and synthesis, audiovisual speech recognition and synthesis, existing corpora for training neural network models, and current systems for automatic machine translation. Current neural network approaches are presented, including the use of deep learning methods such as convolutional and recurrent neural networks as well as transformers. An analysis of existing corpora for training recognition and synthesis systems is provided, along with an evaluation of the challenges and limitations of existing machine translation systems. The main shortcomings and specific problems of current automatic machine translation technologies are identified, and promising solutions are proposed. Special attention is given to the applicability of automatic machine translation systems in real-world scenarios. The need for further research in data collection and annotation, development of new methods and neural network models, and creation of innovative technologies for processing audio and video data to enhance the quality and efficiency of the existing automatic machine translation systems is highlighted.

About the Authors

D. V. Ivanko
St. Petersburg Federal Research Center of the Russian Academy of Sciences
Russian Federation

Denis V. Ivanko - PhD, Senior Researcher

Saint Petersburg (SPC RAS), 199178



D. A. Ryumin
St. Petersburg Federal Research Center of the Russian Academy of Sciences
Russian Federation

Dmitry A. Ryumin - PhD, Senior Researcher

Saint Petersburg (SPC RAS), 199178



References

1. Mehrish A., Majumder N., Bharadwaj R., Mihalcea R., Poria S. A review of deep learning techniques for speech processing. Information Fusion, 2023, vol. 99, pp. 101869. https://doi.org/10.1016/j.inffus.2023.101869

2. Ryumin D., Ivanko D., Ryumina E. Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors, 2023, vol. 23, no. 4, pp. 2284. https://doi.org/10.3390/s23042284

3. Axyonov A., Ryumin D., Ivanko D., Kashevnik A., Karpov A. Audiovisual speech recognition in-the-wild: multi-angle vehicle cabin corpus and attention-based method. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 8195–8199. https://doi.org/10.1109/ICASSP48485.2024.10448048

4. Ma P., Haliassos A., Fernandez-Lopez A., Chen H., Petridis S., Pantic M. Auto-AVSR: audio-visual speech recognition with automatic labels. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096889

5. Wang X., Mi J., Li B., Zhao Y., Meng J. CATNet: Cross-modal fusion for audio-visual speech recognition. Pattern Recognition Letters, 2024, vol. 178, pp. 216–222. https://doi.org/10.1016/j.patrec.2024.01.002

6. Ryumin D., Axyonov A., Ryumina E., Ivanko D., Kashevnik A., Karpov A. Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems. Expert Systems with Applications, 2024, vol. 252. part. A, pp. 124159. https://doi.org/10.1016/j.eswa.2024.124159

7. Ryumin D., Karpov A. Towards automatic recognition of sign language gestures using kinect 2.0. Lecture Notes in Computer Science, 2017, vol. 10278, pp. 89–101. https://doi.org/10.1007/978-3-319-58703-5_7

8. Keskin C., Kıraç F., Kara Y.E., Akarun L. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. Lecture Notes in Computer Science, 2012, vol. 7577, pp. 852– 863. https://doi.org/10.1007/978-3-642-33783-3_61

9. Keskin C., Kıraç F., Kara Y.E., Akarun L. Real time hand pose estimation using depth sensors. Consumer Depth Cameras for Computer Vision: Research Topics and Applications, 2013, pp. 119–137. https://doi.org/10.1007/978-1-4471-4640-7_7

10. Taylor J., Tankovich V., Tang D., Keskin C., Kim D., Davidson P., Kowdle A., Izadi S. Articulated distance fields for ultra-fast tracking of hands interacting. ACM Transactions on Graphics, 2017, vol. 36, no. 6, pp. 1–12. https://doi.org/10.1145/3130800.3130853

11. Kındıroğlu A.A., Özdemir O., Akarun L. Temporal accumulative features for sign language recognition. Proc. of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 1288–1297. https://doi.org/10.1109/ICCVW.2019.00164

12. Orbay A., Akarun L. Neural sign language translation by learning tokenization. Proc. of the 15th International Conference on Automatic Face and Gesture Recognition (FG), 2020, pp. 222–228. https://doi.org/10.1109/FG47880.2020.00002

13. Camgoz N.C., Hadfield S., Koller O., Ney H., Bowden R. Neural sign language translation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7784–7793. https://doi.org/10.1109/CVPR.2018.00812

14. Koller O., Camgoz N.C., Ney H., Bowden R. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, vol. 42, no. 9, pp. 2306–2320. https://doi.org/10.1109/TPAMI.2019.2911077

15. Camgoz N.C., Koller O., Hadfield S., Bowden R. Multi-channel transformers for multi-articulatory sign language translation. Lecture Notes in Computer Science, 2020, vol. 12538, pp. 301–319. https:// doi.org/10.1007/978-3-030-66823-5_18

16. Narayana P., Beveridge J.R., Draper B.A. Gesture recognition: focus on the hands. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5235–5244. https://doi.org/10.1109/CVPR.2018.00549

17. Zhu G., Zhang L., Shen P., Song J. Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access, 2017, vol. 5, pp. 4517–4524. https://doi.org/10.1109/ACCESS.2017.2684186

18. Abavisani M., Joze H.R.V., Patel V.M. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1165–1174. https://doi.org/10.1109/CVPR.2019.00126

19. Elboushaki A., Hannane R., Afdel K., Koutti L. MultiD-CNN: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences. Expert Systems with Applications, 2020, vol. 139, pp. 112829. https://doi.org/10.1016/j.eswa.2019.112829

20. Amangeldy N., Kudubayeva S., Kassymova A., Karipzhanova A., Razakhova B., Kuralov S. Sign language recognition method based on palm definition model and multiple classification. Sensors, 2022, vol. 22, no. 17, pp. 6621. https://doi.org/10.3390/s22176621

21. Damaneh M.M., Mohanna F., Jafari P. Static hand gesture recognition in sign language based on convolutional neural network with feature extraction method using ORB descriptor and Gabor filter. Expert Systems with Applications, 2023, vol. 211, pp. 118559. https://doi.org/10.1016/j.eswa.2022.118559

22. Núñez-Marcos A., Perez-de-Viñaspre O., Labaka G. A survey on sign language machine translation. Expert Systems with Applications, 2023, vol. 213. part. B, pp. 118993. https://doi.org/10.1016/j.eswa.2022.118993

23. Bohacek M., Hrúz M. Learning from what is already out there: fewshot sign language recognition with online dictionaries. Proc. of the 17th International Conference on Automatic Face and Gesture Recognition (FG), 2023, pp. 1–6. https://doi.org/10.1109/FG57933.2023.10042544

24. Wei S.E., Ramakrishna V., Kanade T., Sheikh Y. Convolutional pose machines. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724–4732. https://doi.org/10.1109/CVPR.2016.511

25. Naert L., Larboulette C., Gibet S. A survey on the animation of signing avatars: from sign representation to utterance synthesis. Computers and Graphics, 2020, vol. 92, pp. 76–98. https://doi.org/10.1016/j.cag.2020.09.003

26. Mujahid A., Awan M.J., Yasin A., Mohammed M.A., Damaševičius R., Maskeliūnas R., Abdulkareem K.H. Real-time hand gesture recognition based on deep learning YOLOv3 model. Applied Sciences, 2021, vol. 11, no. 9, pp. 4164. https://doi.org/10.3390/app11094164

27. Wang Y., Yu B., Wang L., Zu C., Lalush D.S., Lin W., Wu X., Zhou J., Shen D., Zhou L. 3D conditional generative adversarial networks for high-quality PET image estimation at low dose. NeuroImage, 2018, vol. 174, pp. 550–562. https://doi.org/10.1016/j.neuroimage.2018.03.045

28. Vahdat A., Kautz J. NVAE: A deep hierarchical variational autoencoder. Proc. of the Neural Information Processing Systems (NeurIPS), 2020, pp. 19667–19679.

29. Ma C., Guo Y., Yang J., An W. Learning multi-view representation with LSTM for 3-D shape recognition and retrieval. IEEE Transactions on Multimedia, 2019, vol. 21, no. 5, pp. 1169–1182. https://doi.org/10.1109/TMM.2018.2875512

30. Vasileiadis M., Bouganis C.-S., Tzovaras D. Multi-person 3D pose estimation from 3D cloud data using 3D convolutional neural networks. Computer Vision and Image Understanding, 2019, vol. 185, pp. 12–23. https://doi.org/10.1016/j.cviu.2019.04.011

31. Lin J., Yuan Y., Shao T., Zhou K. Towards high-fidelity 3D face reconstruction from in-the-wild images using graph convolutional networks. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5890–5900. https://doi.org/10.1109/cvpr42600.2020.00593

32. Liu R., Shen J., Wang H., Chen C., Cheung S.-C., Asari V. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5063–5072. https://doi.org/10.1109/cvpr42600.2020.00511

33. Zhang Z., Sun L., Yang Z., Chen L., Yang Y. Global-correlated 3D-decoupling transformer for clothed avatar reconstruction. Proc. of the Neural Information Processing Systems (NeurIPS), 2023, pp. 7818–7830.

34. Dupont S., Luettin J. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2000, vol. 2, no. 3, pp. 141–151. https://doi.org/10.1109/6046.865479

35. Ivanko D., Ryumin D., Axyonov A., Kashevnik A. Speaker-dependent visual command recognition in vehicle cabin: methodology and evaluation. Lecture Notes in Computer Science, 2021, vol. 12997, pp. 291–302. https://doi.org/10.1007/978-3-030-87802-3_27

36. Ivanko D., Ryumin D., Kipyatkova I., Axyonov A., Karpov A. Lipreading using pixel-based and geometry-based features for multimodal human-robot interfaces. Smart Innovation, Systems and Technologies, 2020, vol. 154, pp. 477–486. https://doi.org/10.1007/978-981-13-9267-2_39

37. Axyonov A.A., Ryumina E.V., Ryumin D.A., Ivanko D.V., Karpov A.A. Neural network-based method for visual recognition of driver’s voice commands using attention mechanism. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 4, pp. 767–775. (in Russian). https://doi.org/10.17586/2226-1494-2023-23-4-767-775

38. Petridis S., Pantic M. Deep complementary bottleneck features for visual speech recognition. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2304–2308. https://doi.org/10.1109/ICASSP.2016.7472088

39. Takashima Y., Aihara R., Takiguchi T., Ariki Y., Mitani N., Omori K., Nakazono K. Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2016, pp. 277–281. https://doi.org/10.21437/Interspeech.2016-721

40. Ninomiya H., Kitaoka N., Tamura S., Iribe Y., Takeda K. Integration of deep bottleneck features for audio-visual speech recognition. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, pp. 563–567. https://doi.org/10.21437/Interspeech.2015-204

41. Ivanko D., Ryumin D., Karpov A. A review of recent advances on deep learning methods for audio-visual speech recognition. Mathematics, 2023, vol. 11, no. 12, pp. 2665. https://doi.org/10.3390/math11122665

42. Ma P., Petridis S., Pantic M. End-to-end audio-visual speech recognition with conformers. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7613–7617. https://doi.org/10.1109/ICASSP39728.2021.9414567

43. Hong J., Kim M., Choi J., Ro Y.M. Watch or listen: Robust audiovisual speech recognition with visual corruption modeling and reliability scoring. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18783–18794. https://doi.org/10.1109/CVPR52729.2023.01801

44. Li G., Deng J., Geng M., Jin Z., Wang T., Hu S., Cui M., Meng H., Liu X. Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, vol. 31, pp. 2707–2723. https://doi.org/10.1109/TASLP.2023.3294705

45. Burchi M., Timofte R. Audio-visual efficient conformer for robust speech recognition. Proc. of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2257–2266. https://doi.org/10.1109/WACV56688.2023.00229

46. Chang O., Liao H., Serdyuk D., Shahy A., Siohan O. Conformer is all you need for visual speech recognition. Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10136–10140. https://doi.org/10.1109/icassp48485.2024.10446532

47. Wand M., Koutník J., Schmidhuber J. Lipreading with long short-term memory. Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6115–6119. https://doi.org/10.1109/ICASSP.2016.7472852

48. Assael Y.M., Shillingford B., Whiteson S., De Freitas N. LipNet: end-to-end sentence-level lipreading. arXiv, 2016. arXiv:1611.01599. https://doi.org/10.48550/arXiv.1611.01599

49. Shi B., Hsu W.N., Mohamed A. Robust self-supervised audio-visual speech recognition. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 2118–2122. https://doi.org/10.21437/interspeech.2022-99

50. Ivanko D., Ryumin D., Kashevnik A.M., Axyonov A., Kitenko A., Lashkov I., Karpov A. DAVIS: driver’s audio-visual speech recognition. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 1141–1142.

51. Zhou P., Yang W., Chen W., Wang Y., Jia J. Modality attention for end-to-end audio-visual speech recognition. Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6565–6569. https://doi.org/10.1109/ICASSP.2019.8683733

52. Makino T., Liao H., Assael Y., Shillingford B., Garcia B., Braga O., Siohan O. Recurrent neural network transducer for audio-visual speech recognition. Proc. of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 905–912. https:// doi.org/10.1109/ASRU46091.2019.9004036

53. Li J., Li C., Wu Y., Qian Y. Unified cross-modal attention: robust audio-visual speech recognition and beyond. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, vol. 32, pp. 1941–1953. https://doi.org/10.1109/TASLP.2024.3375641

54. Tan X., Qin T., Soong F., Liu T.Y. A survey on neural speech synthesis. arXiv, 2021. arXiv:2106.15561. https://doi.org/10.48550/arXiv.2106.15561

55. de Barcelos Silva A., Gomes M.M., da Costa C.A., da Rosa Righi R., Barbosa J.L.V., Pessin G., de Doncker G., Federizzi G. Intelligent personal assistants: a systematic literature review. Expert Systems with Applications, 2020, vol. 147, pp. 113193. https://doi.org/10.1016/j.eswa.2020.113193

56. Oord A., Li Y., Babuschkin I., Simonyan K., Vinyals O., Kavukcuoglu K., Driessche G., Lockhart E., Cobo L., Stimberg F., Casagrande N., Grewe D., Noury S., Dieleman S., Elsen E., Kalchbrenner N., Zen H., Graves A., King H., Walters T., Belov D., Hassabis D. Parallel wavenet: fast high-fidelity speech synthesis. Proc. of the 35th International Conference on Machine Learning (ICML), 2018, pp. 3918–3926.

57. Wang Y., Skerry-Ryan R.J., Stanton D., Wu Y., Weiss R.J., Jaitly N., Yang Z., Xiao Y., Chen Z., Bengio S., Le Q., Agiomyrgiannakis Y., Clark R., Saurous R.A. Tacotron: towards end-to-end speech synthesis. Proc. of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, pp. 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452

58. Arık S.Ö., Chrzanowski M., Coates A., Diamos G., Gibiansky A., Kang Y., Li X., Miller J., Ng A., Raiman J., Sengupta S., Shoeybi M. Deep voice: real-time neural text-to-speech. Proc. of the 34th International Conference on Machine Learning (ICML), 2017, pp. 195–204.

59. Li N., Liu S., Liu Y., Zhao S., Liu M. Neural speech synthesis with transformer network. Proc. of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, no. 1, pp. 6706–6713. https://doi.org/10.1609/AAAI.V33I01.33016706

60. Ren Y., Ruan Y., Tan X., Qin T., Zhao S., Zhao Z., Liu T.Y. Fastspeech: fast, robust and controllable text to speech. Proc. of the Neural Information Processing Systems (NeurIPS), 2019, pp. 1–10.

61. Prenger R., Valle R., Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143

62. Kumar K., Kumar R., De Boissiere T., Gestin L., Teoh W.Z., Sotelo J., de Brébisson A., Bengio Y., Courville A.C. Melgan: generative adversarial networks for conditional waveform synthesis. Proc. of the Neural Information Processing Systems (NeurIPS), 2019, pp. 320–335.

63. Yamamoto R., Song E., Kim J.M. Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6199–6203. https://doi.org/10.1109/ICASSP40776.2020.9053795

64. Valin J.M., Skoglund J. LPCNet: Improving neural speech synthesis through linear prediction. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5891–5895. https://doi.org/10.1109/icassp.2019.8682804

65. Asimopoulos D.C., Nitsiou M., Lazaridis L., Fragulis G.F. Generative adversarial networks: a systematic review and applications. SHS Web of Conferences, 2022, vol. 139, pp. 03012. https://doi.org/10.1051/shsconf/202213903012

66. Kong J., Kim J., Bae J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Proc. of the Neural Information Processing Systems (NeurIPS), 2020, pp. 17022–17033.

67. Fang W., Chung Y.A., Glass J. Towards transfer learning for end-toend speech synthesis from deep pre-trained language models. arXiv, 2019, arXiv:1906.07307. https://doi.org/10.48550/arXiv.1906.07307

68. Valle R., Shih K., Prenger R., Catanzaro B. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. Proc. of the 9th International Conference on Learning Representations (ICLR), 2021, pp. 1–6.

69. Chen N., Zhang Y., Zen H., Weiss R.J., Norouzi M., Chan W. Wavegrad: estimating gradients for waveform generation. Proc. of the 9th International Conference on Learning Representations (ICLR), 2021, pp. 1–8.

70. Ping W., Peng K., Chen J. Clarinet: Parallel wave generation in endto-end text-to-speech. Proc. of the 7th International Conference on Learning Representations (ICLR), 2019, pp. 1–7.

71. Camgöz N.C., Koller O., Hadfield S., Bowden R. Sign language transformers: joint end-to-end sign language recognition and translation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004

72. Bragg D., Koller O., Caselli N., Thies W. Exploring collection of sign language datasets: privacy, participation, and model performance. Proc. of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility, 2020, pp. 1–14. https://doi.org/10.1145/3373625.3417024

73. Caselli N.K., Sehyr Z.S., Cohen-Goldberg A.M., Emmorey K. ASLLEX: a lexical database of american sign language. Behavior Research Methods, 2017, vol. 49, no. 2, pp. 784–801. https://doi.org/10.3758/s13428-016-0742-0

74. Forster J., Schmidt C., Koller O., Bellgardt M., Ney H. Extensions of the sign language recognition and translation corpus RWTHPHOENIX-Weather. Proc. of the 9th International Conference on Language Resources and Evaluation (LREC), 2014, pp. 1911–1916.

75. Azad R., Asadi-Aghbolaghi M., Kasaei S., Escalera S. Dynamic 3D hand gesture recognition by learning weighted depth motion maps. IEEE Transactions on Circuits and Systems for Video Technology, 2019, vol. 29, no. 6, pp. 1729–1740. https://doi.org/10.1109/TCSVT.2018.2855416

76. Chen Y., Wei F., Sun X., Wu Z., Lin S. A simple multi-modality transfer learning baseline for sign language translation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5110–5120. https://doi.org/10.1109/CVPR52688.2022.00506

77. Escalera S., Baró X., Gonzàlez J., Bautista M.A., Madadi M., Reyes M., Ponce-López V., Escalante H.J., Shotton J., Guyon I. ChaLearn looking at people challenge 2014: dataset and results. Lecture Notes in Computer Science, 2015, vol. 8925, pp. 459–473. https://doi.org/10.1007/978-3-319-16178-5_32

78. Kagirov I., Ivanko D., Ryumin D., Axyonov A., Karpov A. TheRuSLan: database of russian sign language. Proc. of the 12th International Conference on Language Resources and Evaluation (LREC), 2020, pp. 6079–6085.

79. Sincan O.M., Keles H.Y. AUTSL: a large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access, 2020, vol. 8, pp. 181340–181355. https://doi.org/10.1109/ACCESS.2020.3028072

80. Kapitanov A., Kvanchiani K., Nagaev A., Kraynov R., Makhliarchuk A. HaGRID – HAnd gesture recognition image dataset. Proc. of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 4560–4569. https://doi.org/10.1109/WACV57701.2024.00451

81. Petridis S., Wang Y., Ma P., Li Z., Pantic M. End-to-end visual speech recognition for small-scale datasets. Pattern Recognition Letters, 2020, vol. 131, pp. 421–427. https://doi.org/10.1016/j.patrec.2020.01.022

82. Cooke M., Barker J., Cunningham S., Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006, vol. 120, no. 5, pp. 2421–2424. https://doi.org/10.1121/1.2229005

83. Chung J.S., Zisserman A. Lip reading in the wild. Lecture Notes in Computer Science, 2017, vol. 10112, pp. 87–103. https://doi.org/10.1007/978-3-319-54184-6_6

84. Sequeira A.F., Monteiro J.C., Rebelo A., Oliveira H.P. MobBIO: a multimodal database captured with a portable handheld device. Proc. of the 9th International Conference on Computer Vision Theory and Applications (VISAPP), 2014, pp. 133–139. https://doi.org/10.5220/0004679601330139

85. Parekh D., Gupta A., Chhatpar S., Yash A., Kulkarni M. Lip reading using convolutional auto encoders as feature extractor. Proc. of the IEEE 5th International Conference for Convergence in Technology (I2CT), 2019, pp. 1–6. https://doi.org/10.1109/I2CT45611.2019.9033664

86. Leeson L., Sheikh H. SIGNALL: a european partnership approach to deaf studies via new technologies. Proc. of the INTED, 2009, pp. 1270–1279.

87. Loizides F., Basson S., Kanevsky D., Prilepova O., Savla S., Zaraysky S. Breaking boundaries with live transcribe: expanding use cases beyond standard captioning scenarios. Proc. of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility, 2020, pp. 1–6. https://doi.org/10.1145/3373625.3417300

88. Sinha A., Choi C., Ramani K. DeepHand: robust hand pose estimation by completing a matrix imputed with deep features. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4150–4158. https://doi.org/10.1109/CVPR.2016.450

89. Ee L.W.S., Ramachandiran C.R., Logeswaran R. Real-time sign language learning system. Journal of Physics: Conference Series, 2020, vol. 1712, pp. 12011. https://doi.org/10.1088/1742-6596/1712/1/012011

90. Junczys-Dowmunt M. Microsoft translator at wmt 2019: towards large-scale document-level neural machine translation. Proc. of the Conference on Machine Translation, 2019, pp. 225–233. https://doi.org/10.18653/v1/W19-5321

91. Hong F., You S., Wei M., Zhang Y., Guo Z. MGRA: motion gesture recognition via accelerometer. Sensors, 2016, vol. 16, no. 4, pp. 530. https://doi.org/10.3390/s16040530


Review

For citations:


Ivanko D.V., Ryumin D.A. Automatic sign language translation: a review of neural network methods for recognition and synthesis of spoken and signed language. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2024;24(5):669-686. (In Russ.) https://doi.org/10.17586/2226-1494-2024-24-5-669-686

Views: 58


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)