Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs

A. B. Ivanov; A. A. Shalyto; V. I. Ulyantsev

doi:10.17586/2226-1494-2025-25-4-718-726

Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs

A. B. Ivanov, A. A. Shalyto, V. I. Ulyantsev

https://doi.org/10.17586/2226-1494-2025-25-4-718-726

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper considers the comparative analysis of string datasets represented as time series of samples. We propose a method to increase the accuracy of determining differences between two samples. Based on this method, a method for analyzing time series of three samples has been developed, which allows for more accurate changes investigation between samples. The use of three samples in the analysis is due to the specific nature of the practical task of processing metagenomic sample sequencing data, obtaining a larger number of which is very resource-intensive. To classify strings from one sample into detected and undetected strings in another sample, a method of comparing two samples using k-mers and the de Bruijn graph is proposed. It implements decision rules based on statistics of the frequency of k-mers occurrence, different values of the parameter k, and information about possible errors in the strings. To analyze time series of three samples (the original and final sample for one object and the modifying sample for another object), a method based on pairwise comparison of samples is developed. It is used to divide the strings of each sample into groups depending on the detection of strings in other samples. The developed method for analyzing time series has been tested on two types of generated metagenomic data, represented as a set of strings. It was shown that the method allows distinguishing organisms that have differences in genomes in at least one symbol for every 10,000 symbols. High (more than 80 %) recall and precision of the results of string classification were demonstrated when analyzing simulated complex data, with properties comparable to real data. The developed method allows comparing metagenomic samples represented as a set of strings using only the data itself and without requiring additional information. This allows for a more accurate analysis compared to existing methods that compare samples based on the results of string classification in taxonomic annotation databases. The developed methods can also be used in other areas of string data processing such as analyzing changes in author style when writing a series of texts.

Keywords

time series, de Bruijn graph, k-mers, comparative analysis, strings classification, metagenomic samples

About the Authors

A. B. Ivanov

Lopukhin Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency (FRCC PCM); ITMO University
Россия

Artem B. Ivanov, Junior Researcher, PhD Student

119435; Moscow; 197101; Saint Petersburg

sc 5722243893

A. A. Shalyto

ITMO University
Россия

Anatoly A. Shalyto, D.Sc., Professor, Chief Researcher

197101; Saint Petersburg

sc 56131789500

V. I. Ulyantsev

ITMO University
Россия

Vladimir I. Ulyantsev, PhD, Associate Professor

197101; Saint Petersburg

sc 55062303000

References

1. Brown P.F., Della Pietra V.J., Desouza P.V., Lai J.C., Mercer R.L. Classbased n-gram models of natural language. Computational Linguistics, 1992, vol. 18, no. 4, pp. 467–480.

2. Cavnar W.B., Trenkle J.M. N-gram-based text categorization. Proc. of the 3<sup>rd</sup> Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94), 1994, pp. 161–175.

3. Rajalakshmi R., Aravindan C. Web page classification using n-gram based URL features. Proc. of the Fifth International Conference on Advanced Computing (ICoAC), 2013, pp. 15–21. doi: 10.1109/icoac.2013.6921920

4. Riseman E.M., Hanson A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Transactions on Computers, 1974, vol. C-23, no. 5, pp. 480–493. doi: 10.1109/T-C.1974.223971

5. Sidorov G., Gupta A., Tozer M., Catala D., Catena A., Fuentes S. Rule-based system for automatic grammar correction using syntactic n-grams for English language learning (L2). Proc. of the 17<sup>th</sup> Conference on Computational Natural Language Learning: Shared Task, 2013, pp. 96–101.

6. Barrón-Cedeño A., Rosso P. On automatic plagiarism detection based on n-grams comparison. Lecture Notes in Computer Science, 2009, vol. 548, pp. 696–700. doi: 10.1007/978-3-642-00958-7_69

7. Huang S., Zhang H., Bao E. A comprehensive review of the de Bruijn graph and its interdisciplinary applications in computing. Engineered Science, 2024, vol. 28, pp. 1061. doi: 10.30919/es1061

8. Idury R.M., Waterman M.S. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 1995, vol. 2, no. 2, pp. 291–306. doi: 10.1089/cmb.1995.2.291

9. Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 2001, vol. 98, no. 17, pp. 9748–9753. doi: 10.1073/pnas.171285098

10. Compeau P.E.C, Pevzner P.A., Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 2011, vol. 29, no. 11, pp. 987–991. doi: 10.1038/nbt.2023

11. Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Research, 2017, vol. 27, no. 5, pp. 824–834. doi: 10.1101/gr.213959.116

12. Compeau Ph., Pevzner P. Bioinformatics Algorithms: an Active Learning Approach. Active Learning Publishers, 2018, 684 p.

13. Poisson S.-D. Recherches sur la Probabilité des Jugements en Matière Criminelle et en Matière Civile: Précédées des Règles Générales du Calcul des Probabilités. Adegi Graphics LLC, 1999, 431 p. (in French)

14. Gourlé H., Karlsson-Lindsjö O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 2019, vol. 35, no. 3, pp. 521–522. doi: 10.1093/bioinformatics/bty630

15. Zou Y.,Xue W., Luo G., Deng Z., Qin P., Guo R., et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nature Biotechnology, 2019, vol. 37, no. 2, pp. 179–185. doi: 10.1038/s41587-018-0008-8

16. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., Phillippy A.M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 2016, vol. 17, pp. 132. doi: 10.1186/s13059-016-0997-x

17. Buckland M., Gey F. The relationship between recall and precision. Journal of the American Society for Information Science, 1994, vol. 45, no. 1, pp. 12–19. doi: 10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L

Review

For citations:

Ivanov A.B., Shalyto A.A., Ulyantsev V.I. Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(4):718-726. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-4-718-726

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy