A multimodal approach for depression detection using semi-automatic data annotation and deterministic machine learning methods
https://doi.org/10.17586/2226-1494-2025-25-6-1107-1116
Abstract
A trending task of automatic psycho-emotional human state detection was studied in this work. A scientific interest to researches devoted to the automatic multimodal depression detection can arise out of the widespread of anxiety-depressive disorders and difficulties of their detection in primary health care. A specificity of the task was caused by its complexity, lack of data, imbalance of classes and inaccuracies in it. Comparative researches show that classification results on semi-automatic annotated data are higher than ones on automatic-annotated data. The proposed approach for depression detection combines a semi-automatic data annotation and deterministic machine learning methods with the utilization of several feature sets. To build our models, we utilized the multimodal Extended Distress Analysis Interview Corpus (E-DAIC) which consists of audio recordings, automatically extracted from these audio recordings texts and video feature sets extracted from video recordings as well as annotation including Patient Health Questionnaire (PHQ-8) scale for each recording. A semi-automatic annotation makes it possible to get the exact time stamps and speech texts to reduce the noisiness in the training data. In the proposed approach we use several feature sets, extracted from each modality (acoustic expert feature set eGeMAPS and neural acoustic feature set DenseNet, visual expert feature set OpenFace and text feature set Word2Vec). A complex processing of these features minimizes the effect of class imbalance in the data on classification results. Experimental researches with the use of mostly expert features (DenseNet, OpenFace, Word2Vec) and deterministic machine learning classification methods (Catboost) which have the property of interpretability of classification results yielded the experimental results on the E-DAIC corpus which are comparable with the existing ones in the field (68.0 % and 64.3 % for Weighted F1-measure (WF1) and Unweighted Average Recall (UAR) accordingly). The usage of a semi-automatic annotation approach and modalities fusion improved both quality of annotation and depression detection comparing to the unimodal approaches. More balanced classification results are achieved. The usage of deterministic machine learning classification methods based on decision trees allows us to provide an interpretability analysis of the classification results in the future due to their interpretability feature. Other methods of results interpretation like SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) also can be used for this purpose.
Keywords
About the Authors
A. N. VelichkoRussian Federation
Alena N. Velichko, PhD, Senior Researcher
199178; Saint Petersburg
sc 57203962694
A. A. Karpov
Russian Federation
Alexey A. Karpov, D.Sc., Professor, Head of Laboratory
199178; Saint Petersburg
sc 57219469958
References
1. Ushakov I.B., Bubeev Yu.A., Syrkin L.D., Karpov A.A., Polyakov A.V., Ivanov A.V., Usov V.M. Remote tele-counceling in primary healthcare for screening of anxiety-depressive disorders with a feedback loop from the patient. System analysis and management in biomedical systems, 2023, vol. 22, no. 4, pp. 140–153. (in Russian). doi: 10.36622/VSTU.2023.22.4.022
2. Depressive disorder WHO (depression). WHO. 2023. Available at: https://www.who.int/news-room/fact-sheets/detail/depression (accessed: 22. 08. 2025)
3. Wu W., Zhang C., Woodland P.C. Confidence estimation for automatic detection of depression and Alzheimer’s disease based on clinical interviews. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2024, pp. 3160–3164. doi: 10.21437/Interspeech.2024-546
4. Braun F., Bayerl S.P., Perez-Toro P.A., Hoenig F., Lehfeld H., Hillemacher T., Noeth E., Bocklet T., Riedhammer K. Classifying dementia in the presence of depression: a cross-corpus study. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 2308–2312. doi: 10.21437/Interspeech.2023-1997
5. Brueckner R., Kwon N., Subramanian V., Blaylock N., O’Connell H. Anxiety and Depression Detection using Vocal Biomarkers. Canaryspeech report. 2025. Available at https://canaryspeech.com/blog/anxiety-and-depression-detection-using-vocal-biomarkers/ (accessed: 22. 08. 2025)
6. Ji J., Dong W., Li J., Peng J., Feng C., Liu R., Shi C., Ma Y. Depressive and mania mood state detection through voice as a biomarker using machine learning. Frontiers in Neurology, 2024, vol. 15, pp. 1394210. doi: 10.3389/fneur.2024.1394210
7. Ringeval F., Schuller B., Valstar M., Cummins N., Cowie R., Tavabi L., et al. AVEC 2019 Workshop and Challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. Proc. of the 9<sup>th</sup> International on Audio/Visual Emotion Challenge and Workshop, 2019, pp. 3–12. doi: 10.1145/3347320.3357688
8. Gratch J., Artstein R., Lucas G., Stratou G., Scherer S., Nazarian A., et al. The Distress Analysis Interview Corpus of Human and Computer Interviews. Proc. of the 9<sup>th</sup> International Conference on Language Resources and Evaluation (LREC’14), 2014, pp. 3123–3128.
9. Li Y., Shi S., Yang F., Gao J., Li Y., Tao M., et al. Patterns of comorbidity with anxiety disorders in Chinese women with recurrent major depression. Psychological Medicine, 2012, vol. 42, no. 6, pp. 1239–1248. doi: 10.1017/s003329171100273x
10. Zou B., Han J., Wang Y., Liu R., Zhao S., Feng L., Lyu X., Ma H. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of depressive disorders. IEEE Transactions on Affective Computing, 2022, vol. 14, no. 4, pp. 2823–2838. doi: 10.1109/TAFFC.2022.3181210
11. Campbell E.L., Dineley J., Conde P., Matcham F., White K.M., Oetzmann C., et al. The RADAR-CNS Consortium. Classifying depression symptom severity: Assessment of speech representations in personalized and generalized machine learning models. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 1738–1742. doi: 10.21437/Interspeech.2023-1721
12. Fara S., Hickey O., Georgescu A., Goria S., Molimpakis E., Cummins N. Bayesian Networks for the robust and unbiased prediction of depression and its symptoms utilizing speech and multimodal data. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 1728–1732. doi: 10.21437/Interspeech.2023-1709
13. Tao F., Esposito A., Vinciarelli A. The androids corpus: a new publicly available benchmark for speech based depression detection. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 4149–4153. doi: 10.21437/Interspeech.2023-894
14. Phukan O.C., Jain S., Singh S., Singh M., Budaru A.B., Sarma R. ComFeAT: Combination of neural and spectral features for improved depression detection. arXiv, 2024, arXiv:2406.06774. doi: 10.48550/arXiv.2406.06774
15. Burdisso S., Villatoro-Tello E., Madikeri S., Motlicek P. Node-weighted graph convolutional network for depression detection in transcribed clinical interviews. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 3617–3621. doi: 10.21437/interspeech.2023-1923
16. Zhang X., Li C., Chen W., Zheng J., Li F. Optimizing depression detection in clinical doctor-patient interviews using a multi-instance learning framework. Scientific Reports, 2025, vol. 15, no. 1, pp. 6637. doi: 10.1038/s41598-025-90117-w
17. Tank C., Pol S., Katoch V., Meht S., Anand A., Shah R.R. Depression detection and analysis using large language models on textual and audio-visual modalities. arXiv, 2024, arXiv:2407.06125. doi: 10.48550/arXiv.2407.06125
18. Zhang W., Mao K., Chen J. A multimodal approach for detection and assessment of depression using text, audio and video. Phenomics, 2024, vol. 4, no. 3, pp. 234–249. doi: 10.1007/s43657-023-00152-8
19. Zhang X., Liu H., Xu K., Zhang Q., Liu D., Ahmed B., Epps J. When LLMs meet acoustic landmarks: an efficient approach to integrate speech into large language models for depression detection. Proc. of the Conference on Empirical Methods in Natural Language Processing, 2024, pp. 146–158. doi: 10.18653/v1/2024.emnlp-main.8
20. Dumpala S.H., Dikaios K., Nunes A., Rudzicz F., Uher R., Oore S. Self-supervised embeddings for detecting individual symptoms of depression. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2024, pp. 1450–1454. doi: 10.21437/Interspeech.2024-2344
21. Sadeghi M., Richer R., Egger B., Schindler-Gmelch L., Rupp L.H., Rahimi F., Berking M., Eskofier B.M. Harnessing multimodal approaches for depression detection using large language models and facial expressions. npj Mental Health Research, 2024, vol. 3, no. 1, pp. 66. doi: 10.1038/s44184-024-00112-8
22. Wang J., Ravi V., Flint J., Alwan A. Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification. Speech Communication, 2024, vol. 163, pp. 103106. doi: 10.1016/j.specom.2024.103106
23. Jin N., Ye R., Li P. Diagnosis of depression based on facial multimodal data. Frontiers in Psychiatry, 2025, vol. 16, pp. 1508772. doi: 10.3389/fpsyt.2025.1508772
24. Zhou L., Liu Z., Shangguan Z., Yuan X., Li Y., Hu B. JAMFN: Joint attention multi-scale fusion network for depression detection. Proc. of the Annual Conference of the International Speech Communication Association Interspeech, 2023, pp. 3417–3421. doi: 10.21437/Interspeech.2023-183
25. Velichko A.N., Karpov A.A. AN Approach to depression detection in speech using a semi-automatic data annotation. Information and Control Systems, 2024, no. 4 (131), pp. 2–11. (in Russian). doi: 10.31799/1684-8853-2024-4-2-11
26. Litvinova T., Ryzhkova E. RusNeuroPsych: open corpus for study relations between author demographic, personality traits, lateral preferences and affect in text. International journal of Open Information Technologies, 2018, vol. 6, no. 3, pp. 32–36.
27. Stankevich M., Ignatiev N., Smirnov I. Predicting depression with social media images. Proc. of the 9<sup>th</sup> International Conference on Pattern Recognition Applications and Methods ICPRAM, 2020, vol. 1, pp. 235–240. doi: 10.5220/0009168602350240
28. Stankevich M.A., Smirnov I.V., Kuznetsova Y.M., Kiselnikova N.V., Enikolopov S.N. Predicting depression from essays in Russian. Proc. of the International Conference “Dialogue 2019”, 2019, pp. 647–657.
29. Stankevich M., Smirnov I., Kiselnikova N., Ushakova A. Depression detection from social media profiles. Communications in Computer and Information Science, 2020, vol. 1223, pp. 181–194. doi: 10.1007/978-3-030-51913-1_12
30. Stepanov D., Smirnov A., Ivanov E., Smirnov I., Stankevich M., Danina M. Detection of health-preserving behavior Among VK.com users based on the analysis of graphic, text and numerical data. Lecture Notes in Networks and Systems, 2022, vol. 296, pp. 574–587. doi: 10.1007/978-3-030-82199-9_39
31. Kiselnikova N., Stankevich M., Danina M., Kuminskaya E., Lavrova E. Identification of informative behavior parameters in users of VKontakte social network as markers of depression. Psychology. Journal of Higher School of Economics, 2020, vol. 17, no. 1, pp. 73–88. (in Russian). doi: 10.17323/1813-8918-2020-1-73-88
32. Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Densely connected convolutional networks. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243
33. Baltrusaitis T., Zadeh A., Lim Y.C., Morency L.-P. OpenFace 2.0: facial behavior analysis toolkit. Proc. of the 13<sup>th</sup> IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, pp. 59–66. doi: 10.1109/FG.2018.00019
34. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 2013, vol. 26, pp. 1–9.
35. Boersma P. Praat, a system for doing phonetics by computer. Glot International, 2001, no. 5, pp. 341–345.
36. Velichko A.N., Karpov A.A. Methods and a software system for integrative analysis of destructive paraiinguistic phenomena in colloquial speech. Information and Control Systems, 2023, no. 4 (125), pp. 2–11. (in Russian). doi: 10.31799/1684-8853-2023-4-2-11
37. Gimeno-Gómez D., Bucur A.M., Cosma A., Martínez-Hinarejos C.D., Rosso P. Reading between the frames: multi-modal depression detection in videos from non-verbal cues. Lecture Notes in Computer Science, 2024, vol. 14608, pp. 191–209. doi: 10.1007/978-3-031-56027-9_12
38. Jaegle A., Gimeno F., Brock A., Zisserman A., Vinyals O., Carreira J. Perceiver: General perception with iterative attention. Proc. of the 38<sup>th</sup> International Conference on Machine Learning, 2021, vol. 139, pp. 4651–4664.
39. Li Y., Yang X., Zhao M., Wang Z., Yao Y., Qian W., Qi Sh. FPT-Former: A flexible parallel transformer of recognizing depression by using audiovisual expert-knowledge-based multimodal measures. International Journal of Intelligent Systems, 2024, vol. 1564574, pp. 1–13. doi: 10.1155/2024/1564574
40. Ryumina E.V., Karpov A.A. Comparative analysis of methods for imbalance elimination of emotion classes in video data of facial expressions. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2020, vol. 20, no. 5, pp. 683–691. (in Russian). doi: 10.17586/2226-1494-2020-20-5-683-691
Review
For citations:
Velichko A.N., Karpov A.A. A multimodal approach for depression detection using semi-automatic data annotation and deterministic machine learning methods. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(6):1107-1116. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-6-1107-1116































