Incorporating negative examples into Hidden Markov Model-based classification of peptide sequences
https://doi.org/10.17586/2226-1494-2025-25-5-888-901
Abstract
Hidden Markov Models (HMMs) trained to identify binding regions in peptide sequences have demonstrated the ability to uncover shared amino acid patterns in peptides bound to major histocompatibility complex molecules. In this work, we present an enhanced approach for predicting peptide binding using an ensemble of HMMs. Building on a previously proposed method, we extend it to a classification setting by incorporating both binding (positive) and non-binding (negative) peptide sequences. Our strategy involves training two sets of models on these distinct datasets and selecting ensemble members based on conditional probability estimates. The method was evaluated across six alleles of major histocompatibility complex using two model architectures: simplified architecture with 9 states representing the peptide binding core region and two cycle-states for the amino acids outside this region, and extended architecture, in which each cycle state was replaced by 9 additional states. Models evaluated in comparison with the state-of-the-art MixMHC2pred predictor. Results show a statistically significant improvement in prediction accuracy. Notably, incorporating non-binding peptides during training improved performance in several cases, highlighting the importance of background sequence information in distinguishing binding-specific patterns.
About the Authors
V. A. PolezhaevaRussian Federation
Valeriia A. Polezhaeva — Student
Saint Petersburg, 197101
D. A. Kleverov
United States
Denis A. Kleverov — Visiting Researcher
sc 58741254400
Saint Louis, 631110
A. A. Shalyto
Russian Federation
Anatoly A. Shalyto — D.Sc., Full Professor
sc 56131789500
Saint Petersburg, 197101
M. Artyomov
Russian Federation
Maxim Artyomov — PhD (Chemistry), Full Professor; Professor
sc 9242717500
Saint Petersburg, 197101
Saint Louis, 631110
References
1. Corradin G. Antigen processing and presentation. Immunology Letters, 1990, vol. 25, no. 1–3, pp. 11–13. https://doi.org/10.1016/0165-2478(90)90082-2
2. Abualrous E.T., Sticht J., Freund C. Major histocompatibility complex (MHC) class I and class II proteins: impact of polymorphism on antigen presentation. Current Opinion in Immunology, 2021, vol. 70, pp. 95–104. https://doi.org/10.1016/j.coi.2021.04.009
3. Waldman A.D., Fritz J.M., Lenardo M.J. A guide to cancer immunotherapy: from T cell basic science to clinical practice. Nature Reviews Immunology, 2020, vol. 20, no. 11, pp. 651–668. https://doi.org/10.1038/s41577-020-0306-5
4. Wieczorek M., Abualrous E.T., Sticht J., Alvaro-Benito M., Stolzenberg S., Noé F., Freund C. Major histocompatibility complex (MHC) class I and MHC class II proteins: conformational plasticity in antigen presentation. Frontiers in Immunology, 2017, vol. 8, pp. 292. https://doi.org/10.3389/fimmu.2017.00292
5. Kleverov D.A., Shalyto A.A., Artyomov M.N. A method for constructing interpretable hidden Markov models for the task of identifying binding cores in sequences. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 5, pp. 989–1000. (in Russian). https://doi.org/10.17586/2226-1494-2023-23-5-989-1000
6. Gutiérrez S.E., Esteban E.N., Lützelschwab C.M., Juliarena M.A. Major histocompatibility complex-associated resistance to infectious diseases: the case of bovine leukemia virus infection. Trends and Advances in Veterinary Genetics, 2017, pp. 101–126. https://doi.org/10.5772/intechopen.68416
7. Eddy S.R. Profile hidden Markov models. Bioinformatics, 1998, vol. 14, no. 9, pp. 755–763. https://doi.org/10.1093/bioinformatics/14.9.755
8. Alspach E., Lussier D.M., Miceli A.P., Kizhvatov I., DuPage M., Luoma A.M., et al. MHC-II neoantigens shape tumour immunity and response to immunotherapy. Nature, 2019, vol. 574, no. 7780, pp. 696–701. https://doi.org/10.1038/s41586-019-1671-8
9. Kim M.W., Gao W., Lichti C.F., Gu X., Dykstra T., Cao J., et al. Endogenous self-peptides guard immune privilege of the central nervous system. Nature, 2025, vol. 637, no. 8044, pp. 176–183. https://doi.org/10.1038/s41586-024-08279-y
10. Vita R., Blazeska N., Marrama D., Duesing S., Bennett J., Greenbaum J., et al. The Immune Epitope Database (IEDB): 2024 update. Nucleic Acids Research, 2025, vol. 53, no. D1, pp. D436– D443. https://doi.org/10.1093/nar/gkae1092
11. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009, 767 p. https://doi.org/10.1007/978-0-387-84858-7
12. Capietto A.H., Jhunjhunwala S., Pollock S.B., Lupardus P., Wong J., Hänsch L., et al. Mutation position is an important determinant for predicting cancer neoantigens. Journal of Experimental Medicine, 2020, vol. 217, no. 4, pp. e20190179. https://doi.org/10.1084/14.
13. Rahman K.S., Chowdhury E.U., Sachse K., Kaltenboeck B. Inadequate reference datasets biased toward short non-epitopes confound B-cell epitope prediction. The Journal of Biological Chemistry, 2016, vol. 291, no. 28, pp. 14585–14599. https://doi.org/10.1074/jbc.M116.729020
14. Mudge J.M., Carbonell-Sala S., Diekhans M., Martinez J.G., Hunt T., Jungreis I., et al. GENCODE 2025: reference gene annotation for human and mouse. Nucleic Acids Research, 2025, vol. 53, no. D1, pp. D966–D975. https://doi.org/10.1093/nar/gkae1078
15. Forney G.D. The viterbi algorithm. Proceedings of the IEEE, 1973, vol. 61, no. 3, pp. 268–278. https://doi.org/10.1109/proc.1973.9030
16. Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, vol. 77, no. 2, pp. 257–286. https://doi.org/10.1109/5.18626
17. Nielsen M., Lundegaard C., Lund O. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics, 2007, vol. 8, pp. 238. https://doi.org/10.1186/1471-2105-8-238
18. DeLong E.R., DeLong D.M., Clarke-Pearson D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 1988, vol. 44, no. 3, pp. 837–845. https://doi.org/10.2307/2531595
19. Sun X., Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Processing Letters, 2014, vol. 21, no. 11, pp. 1389–1393. https://doi.org/10.1109/LSP.2014.2337313
20. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods, 2020, vol. 17, no. 3, pp. 261– 272. https://doi.org/10.1038/s41592-019-0686-2
21. Racle J., Michaux J., Rockinger G.A., Arnaud M., Bobisse S., Chong C., et al. Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes. Nature Biotechnology, 2019, vol. 37, no. 11, pp. 1283–1286. https://doi.org/10.1038/s41587-019-0289-6
22. Koşaloğlu-Yalçın Z., Sidney J., Chronister W., Peters B., Sette A. Comparison of HLA ligand elution data and binding predictions reveals varying prediction performance for the multiple motifs recognized by HLA-DQ2.5. Immunology, 2021, vol. 162, no. 2, pp. 235–247. https://doi.org/10.1111/imm.13279jem.20190179
Review
For citations:
Polezhaeva V.A., Kleverov D.A., Shalyto A.A., Artyomov M. Incorporating negative examples into Hidden Markov Model-based classification of peptide sequences. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(5):888-901. https://doi.org/10.17586/2226-1494-2025-25-5-888-901































