Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs
https://doi.org/10.17586/2226-1494-2025-25-4-718-726
Abstract
The paper considers the comparative analysis of string datasets represented as time series of samples. We propose a method to increase the accuracy of determining differences between two samples. Based on this method, a method for analyzing time series of three samples has been developed, which allows for more accurate changes investigation between samples. The use of three samples in the analysis is due to the specific nature of the practical task of processing metagenomic sample sequencing data, obtaining a larger number of which is very resource-intensive. To classify strings from one sample into detected and undetected strings in another sample, a method of comparing two samples using k-mers and the de Bruijn graph is proposed. It implements decision rules based on statistics of the frequency of k-mers occurrence, different values of the parameter k, and information about possible errors in the strings. To analyze time series of three samples (the original and final sample for one object and the modifying sample for another object), a method based on pairwise comparison of samples is developed. It is used to divide the strings of each sample into groups depending on the detection of strings in other samples. The developed method for analyzing time series has been tested on two types of generated metagenomic data, represented as a set of strings. It was shown that the method allows distinguishing organisms that have differences in genomes in at least one symbol for every 10,000 symbols. High (more than 80 %) recall and precision of the results of string classification were demonstrated when analyzing simulated complex data, with properties comparable to real data. The developed method allows comparing metagenomic samples represented as a set of strings using only the data itself and without requiring additional information. This allows for a more accurate analysis compared to existing methods that compare samples based on the results of string classification in taxonomic annotation databases. The developed methods can also be used in other areas of string data processing such as analyzing changes in author style when writing a series of texts.
About the Authors
A. B. IvanovRussian Federation
Artem B. Ivanov, Junior Researcher, PhD Student
119435; Moscow; 197101; Saint Petersburg
sc 5722243893
A. A. Shalyto
Russian Federation
Anatoly A. Shalyto, D.Sc., Professor, Chief Researcher
197101; Saint Petersburg
sc 56131789500
V. I. Ulyantsev
Russian Federation
Vladimir I. Ulyantsev, PhD, Associate Professor
197101; Saint Petersburg
sc 55062303000
References
1. Brown P.F., Della Pietra V.J., Desouza P.V., Lai J.C., Mercer R.L. Classbased n-gram models of natural language. Computational Linguistics, 1992, vol. 18, no. 4, pp. 467–480.
2. Cavnar W.B., Trenkle J.M. N-gram-based text categorization. Proc. of the 3<sup>rd</sup> Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94), 1994, pp. 161–175.
3. Rajalakshmi R., Aravindan C. Web page classification using n-gram based URL features. Proc. of the Fifth International Conference on Advanced Computing (ICoAC), 2013, pp. 15–21. doi: 10.1109/icoac.2013.6921920
4. Riseman E.M., Hanson A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Transactions on Computers, 1974, vol. C-23, no. 5, pp. 480–493. doi: 10.1109/T-C.1974.223971
5. Sidorov G., Gupta A., Tozer M., Catala D., Catena A., Fuentes S. Rule-based system for automatic grammar correction using syntactic n-grams for English language learning (L2). Proc. of the 17<sup>th</sup> Conference on Computational Natural Language Learning: Shared Task, 2013, pp. 96–101.
6. Barrón-Cedeño A., Rosso P. On automatic plagiarism detection based on n-grams comparison. Lecture Notes in Computer Science, 2009, vol. 548, pp. 696–700. doi: 10.1007/978-3-642-00958-7_69
7. Huang S., Zhang H., Bao E. A comprehensive review of the de Bruijn graph and its interdisciplinary applications in computing. Engineered Science, 2024, vol. 28, pp. 1061. doi: 10.30919/es1061
8. Idury R.M., Waterman M.S. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 1995, vol. 2, no. 2, pp. 291–306. doi: 10.1089/cmb.1995.2.291
9. Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 2001, vol. 98, no. 17, pp. 9748–9753. doi: 10.1073/pnas.171285098
10. Compeau P.E.C, Pevzner P.A., Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 2011, vol. 29, no. 11, pp. 987–991. doi: 10.1038/nbt.2023
11. Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Research, 2017, vol. 27, no. 5, pp. 824–834. doi: 10.1101/gr.213959.116
12. Compeau Ph., Pevzner P. Bioinformatics Algorithms: an Active Learning Approach. Active Learning Publishers, 2018, 684 p.
13. Poisson S.-D. Recherches sur la Probabilité des Jugements en Matière Criminelle et en Matière Civile: Précédées des Règles Générales du Calcul des Probabilités. Adegi Graphics LLC, 1999, 431 p. (in French)
14. Gourlé H., Karlsson-Lindsjö O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 2019, vol. 35, no. 3, pp. 521–522. doi: 10.1093/bioinformatics/bty630
15. Zou Y.,Xue W., Luo G., Deng Z., Qin P., Guo R., et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nature Biotechnology, 2019, vol. 37, no. 2, pp. 179–185. doi: 10.1038/s41587-018-0008-8
16. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., Phillippy A.M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 2016, vol. 17, pp. 132. doi: 10.1186/s13059-016-0997-x
17. Buckland M., Gey F. The relationship between recall and precision. Journal of the American Society for Information Science, 1994, vol. 45, no. 1, pp. 12–19. doi: 10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L
Review
For citations:
Ivanov A.B., Shalyto A.A., Ulyantsev V.I. Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(4):718-726. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-4-718-726