A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data

A. A. Axyonov; E. V. Ryumina; D. A. Ryumin

doi:10.17586/2226-1494-2025-25-4-651-662

A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data

A. A. Axyonov, E. V. Ryumina, D. A. Ryumin

https://doi.org/10.17586/2226-1494-2025-25-4-651-662

Full Text:

PDF (Rus)

Generate QR code

Abstract

This paper addresses the task of generating animations of a digital avatar that synchronously reproduces speech, facial expressions, and gestures based on a bimodal input — namely, a static image and an emotionally colored text. The study explores the integration of acoustic, visual, and affective features into a unified model that enables realistic and expressive avatar behavior aligned with both the semantic content and emotional tone of the utterance. The proposed method includes several stages: extraction of visual landmarks of the face, hands, and body pose; gender recognition for selecting an appropriate voice profile; emotional analysis of the input text; and generation of synthetic speech. All extracted features are integrated within a generative architecture based on a diffusion model enhanced with temporal attention mechanisms and cross-modal alignment strategies. This ensures high-precision synchronization between speech and the avatar nonverbal behavior. The training process utilized two specialized datasets: one focused on gesture modeling, and the other on facial expression synthesis. Annotation was performed using automated spatial landmark extraction tools. Experimental evaluation was conducted on a multiprocessor computing platform with GPU acceleration. The model performance was assessed using a set of objective metrics. The proposed method demonstrated a high degree of visual and semantic coherence: FID — 50.13, FVD — 601.70, SSIM — 0.752, PSNR — 21.997, E-FID — 2.226, Sync-D — 7.003, Sync-C — 6.398. The model effectively synchronizes speech with facial expressions and gestures, accounts for the emotional context of the text, and incorporates features of Russian Sign Language. The proposed approach has potential applications in emotionally aware human — computer interaction systems, digital assistants, educational platforms, and psychological interfaces. The method is of interest to researchers in artificial intelligence, multimodal interfaces, computer graphics, and digital psychology.

Keywords

digital avatar, BiMoDiCA, facial expression, gestures, latent space, animation generation, speech synthesis, Denoising U-Net, Stable Diffusion

About the Authors

A. A. Axyonov

St. Petersburg Federal Research Center of the Russian Academy of Sciences
Russian Federation

Alexander A. Axyonov, PhD, Senior Researcher

199178; Saint Petersburg

sc 57203963345

E. V. Ryumina

St. Petersburg Federal Research Center of the Russian Academy of Sciences
Russian Federation

Elena V. Ryumina, Junior Researcher

199178; Saint Petersburg

sc 57220572427

D. A. Ryumin

St. Petersburg Federal Research Center of the Russian Academy of Sciences
Russian Federation

Dmitry A. Ryumin, PhD, Senior Researcher

199178; Saint Petersburg

sc 57191960214

References

1. Sincan O.M., Keles H.Y. AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods. IEEE Access, 2020, vol. 8, pp. 181340–181355. doi: 10.1109/ACCESS.2020.3028072

2. Kapitanov A., Kvanchiani K., Nagaev A., Kraynov R., Makhliarchuk A. HaGRID-HAnd Gesture Recognition Image Dataset. Proc. of the Winter Conference on Applications of Computer Vision (WACV). 2024 , pp. 4560–4569. doi: 10.1109/WACV57701.2024.00451

3. Busso C., Bulut M., Lee C.C., Kazemzadeh A., Mower E., Kim S., Chang J., Lee S., Narayanan S.S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, vol. 42, no. 4, pp. 335–359. doi: 10.1007/s10579-008-9076-6

4. Shen K., Guo C., Kaufmann M., Zarate J., Valentin J., Song J., Hilliges O. X-Avatar: expressive human avatars. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16911–16921. doi: 10.1109/CVPR52729.2023.01622

5. Zhang H., Chen B., Yang H., Qu L., Wang X., Chen L., Long C., Zhu F., Du D., Zheng M. AvatarVerse: high-quality and stable 3D avatar creation from text and pose. Proc. of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, no. 7, pp. 7124–7132. doi: 10.1609/aaai.v38i7.28540

6. Kim K., Song B. Robust 3D human avatar reconstruction from monocular videos using depth optimization and camera pose estimation. IEEE Access, 2025, vol. 13, pp. 57886–57897. doi: 10.1109/ACCESS.2025.3556445

7. Yuan Y., Li X., Huang Y., De Mello S., Nagano K., Kautz J., Iqbal U. Gavatar: animatable 3D gaussian avatars with implicit mesh learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 896–905. doi: 10.1109/CVPR52733.2024.00091

8. Teotia K., Mallikarjun B.R., Pan X., Kim H., Garrido P., Elgharib M., Theobalt C. HQ3DAvatar: high-quality implicit 3D head avatar. ACM Transactions on Graphics, 2024, vol. 43, no. 3, pp 1–24. doi: 10.1145/3649889

9. Yang L., Zhang Z., Song Y., Hong S., Xu R., Zhao Y., Zhang W., Cui B., Yang M. Diffusion models: a comprehensive survey of methods and applications. ACM Computing Surveys, 2023, vol. 56, no. 4, pp. 1–39. doi: 10.1145/3626235

10. Karras J., Holynski A., Wang T., Kemelmacher-Shlizerman I. DreamPose: fashion image-to-video synthesis via stable diffusion. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 22623–22633. doi: 10.1109/ICCV51070.2023.02073

11. Huang Z., Tang F., Zhang Y., Cun X., Cao J., Li J., Lee T. Make-Your-Anchor: a diffusion-based 2D avatar generation framework. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6997–7006. doi: 10.1109/CVPR52733.2024.00668

12. Blattmann A., Dockhorn T., Kulal S., Mendelevitch D., Kilian M., Lorenz D., Levi Y., English Z., Voleti V., Letts A., Jampani V., Rombach R. Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv, 2023, arXiv:2311.15127. doi: 10.48550/arXiv.2311.15127

13. Zhang L., Rao A., Agrawala M. Adding conditional control to text-to-image diffusion models. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3813–3824. doi: 10.1109/ICCV51070.2023.00355

14. Zhuang S., Li K., Chen X., Wang Y., Liu Z., Qiao Y., Wang Y. Vlogger: make your dream a vlog. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8806–8817. doi: 10.1109/CVPR52733.2024.00841

15. Xu M., Li H., Su Q., Shang H., Zhang L., Liu C., Wang J., Yao Y., Zhu S. Hallo: hierarchical audio-driven visual synthesis for portrait image animation. arXiv, 2024, arXiv:2406.08801. doi: 10.48550/arXiv.2406.08801

16. Yang S., Li H., Wu J., Jing M., Li L., Ji R., Liang J., Fan H., Wang J. MegActor-Sigma: unlocking flexible mixed-modal control in portrait animation with diffusion transformer. Proc. of the AAAI Conference on Artificial Intelligence, 2025, vol.39, no. 9, pp. 9256–9264. doi: 10.1609/aaai.v39i9.33002

17. Lin G., Jiang J., Liang C., Zhong T., Yang J., Zheng Y. CyberHost: taming audio-driven avatar diffusion model with region codebook attention. arXiv, 2024, arXiv:2409.01876. doi: 10.48550/arXiv.2409.01876

18. Chen Z., Cao J., Chen Z., Li Y., Ma C. EchoMimic: lifelike audio-driven portrait animations through editable landmark conditions. Proc. of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, no. 3, pp. 2403–2410. doi: 10.1609/aaai.v39i3.32241

19. Serengil S., Özpınar A. A benchmark of facial recognition pipelines and co-usability performances of modules. Bilişim Teknolojileri Dergisi, 2024, vol. 17, no. 2, pp. 95–107. doi: 10.17671/gazibtd.1399077

20. Bazarevsky V., Kartynnik Y., Vakunov A., Raveendran K., Grundmann M. Blazeface: sub-millisecond seural face detection on mobile GPUs. arXiv, 2019, arXiv:1907.05047. doi: 10.48550/arXiv.1907.05047

21. Zhang F., Bazarevsky V., Vakunov A., Tkachenka A., Sung G., Chang C.L., Grundmann M. MediaPipe hands: on-device real-time hand tracking. arXiv, 2020, arXiv:2006.10214. doi: 10.48550/arXiv.2006.10214

22. Bazarevsky V., Grishchenko I., Raveendran K., Zhu T., Zhang F., Grundmann M. BlazePose: on-device real-time body pose tracking. arXiv, 2020, arXiv:2006.10204. doi: 10.48550/arXiv.2006.10204

23. Xu J., Zou X., Huang K., Chen Y., Liu B., Cheng M., Shi X., Huang J. EasyAnimate: a high-performance long video generation method based on transformer Architecture. arXiv, 2024, arXiv:2405.18991. doi: 10.48550/arXiv.2405.18991

24. Hu L. Animate anyone: consistent and controllable image-to-video synthesis for character animation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8153–8163. doi: 10.1109/CVPR52733.2024.00779

25. Ryumina E., Ryumin D., Axyonov A., Ivanko D., Karpov A. Multicorpus emotion recognition method based on cross-modal gated attention fusion. Pattern Recognition Letters, 2025, vol. 190, pp. 192–200. doi:и 10.1016/j.patrec.2025.02.024

26. Peng Y., Sudo Y., Shakeel M., Watanabe S. OWSM-CTC: an open encoder-only speech foundation model for speech recognition, translation, and language identification. Proc. of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024, vol. 1, pp. 10192–10209. doi: 10.18653/v1/2024.acl-long.549

27. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Kaiser L., Polosukhin I. Attention is all you need. Proc. of the Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 1–11.

28. Kapitanov A., Kvanchiani K., Nagaev A., Petrova E. Slovo: Russian sign language dataset. Lecture Notes in Computer Science, 2023, vol. 14253, pp. 63–73. doi: 10.1007/978-3-031-44137-0_6

29. Xie L., Wang X., Zhang H., Dong C., Shan Y. VFHQ: a high-quality dataset and benchmark for video face super-resolution. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 657–665. doi: 10.1109/CVPRW56347.2022.00081

30. Kagirov I., Ivanko D., Ryumin D., Axyonov A., Karpov A. TheRuSLan: database of russian sign language. Proc. of the12th Conference on Language Resources and Evaluatio (LREC), 2020. pp. 6079–6085.

31. Kagirov I., Ryumin D.A., Axyonov A.A., Karpov A.A. Multimedia database of russian sign language items in 3D. Voprosy jazykoznanija, 2020, no. 1., pp. 104–123. (in Russian). doi: 10.31857/S0373658X0008302-1

32. Axyonov A., Ryumin D., Ivanko D., Kashevnik A., Karpov A. Audio-visual speech recognition in-the-wild: multi-angle vehicle cabin corpus and attention-based method. Proc. of the 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024, pp. 8195–8199. doi: 10.1109/ICASSP48485.2024.10448048

33. Liu Z. Super Convergence cosine annealing with warm-up learning rate. Proc. of the 2nd International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), 2022, pp. 1–7.

34. Wang P., Shen L., Tao Z., He S., Tao D. Generalization analysis of stochastic weight averaging with general sampling. Proc. of the 41st International Conference on Machine Learning (ICML), 2024, pp. 51442–51464.

35. Yang H., Zhang Z., Tang H., Qian J., Yang J. ConsistentAvatar: learning to diffuse fully consistent talking head avatar with temporal guidance. Proc. of the 32nd ACM International Conference on Multimedia, 2024, pp. 3964–3973. doi: 10.1145/3664647.3680619

36. Unterthiner T., Van Steenkiste S., Kurach K., Marinier R., Michalski M., Gelly S. Towards accurate generative models of video: a new metric and challenges. arXiv, 2018, arXiv:1812.01717. doi: 10.48550/arXiv.1812.01717

37. Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004, vol. 13, no. 4, pp. 600–612. doi: 10.1109/TIP.2003.819861

38. Hore A., Ziou D. Image quality metrics: PSNR vs. SSIM. Proc. of the 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369. doi: 10.1109/ICPR.2010.579

39. Deng Y., Yang J., Xu S., Chen D., Jia Y., Tong X. Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. Proc. of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 285–295. doi: 10.1109/CVPRW.2019.00038

40. Prajwal K.R., Mukhopadhyay R., Namboodiri V.P., Jawahar C.V. A lip sync expert is all you need for speech to lip generation in the wild. Proc. of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492. doi: 10.1145/3394171.3413532

Review

For citations:

Axyonov A.A., Ryumina E.V., Ryumin D.A. A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(4):651-662. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-4-651-662

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy