Feature extraction methods for metagenome de Bruijn graphs collections based on samples classification information
https://doi.org/10.17586/2226-1494-2025-25-3-545-553
Abstract
The paper considers the comparative analysis of metagenomic samples collections using de Bruijn graphs. We propose methods for automatic feature extraction based on the results of comparative sample analysis, expert metadata, and statistical tests to improve the accuracy of classification models. In this paper features are connected subgraphs of the de Bruijn graph. The first method, named unique_kmers, is used to extract strings of length k (k-mers) that occur only in samples of the certain class. The second method, named stats_kmers, is used to extract k-mers whose frequency of occurrence statistically differs between sample classes. To extract interpretable features, a third method has been developed that implements the extraction of subgraphs from de Bruijn graphs based on the selected nodes obtained as a result of applying one of the first two methods. Data analysis consists of two stages: firstly, unique_kmers or stats_kmers method is applied for data preprocessing, secondly, the third method is applied to obtain interpretable features. The methods were tested on four generated datasets that model the properties of real metagenomic communities such as the presence of similar species (strains) or differences in the relative abundance of bacteria. The developed methods were used to extract features. Machine learning model was trained in extracted features to classify samples from the test datasets. For comparison, the results of taxonomic annotation of samples using the Kraken2 program were used as features. It was shown that the accuracy of samples classification increased when using features obtained using the proposed methods in classification models compared to classification models trained on taxonomic features. The developed methods are useful for comparative analysis of metagenomic sequencing data and can form the basis of decision support systems, for example, in human diseases diagnostics based on gut microbiota sequencing data.
About the Authors
A. B. IvanovRussian Federation
Artem B. Ivanov — Junior Researcher
Moscow, 119435;
PhD Student
Saint Petersburg, 197101
sc 57222438932
A. A. Shalyto
Russian Federation
Anatoly A. Shalyto — D.Sc., Chief Researcher, Full Professor
Saint Petersburg, 197101
sc 56131789500
V. I. Ulyantsev
Russian Federation
Vladimir I. Ulyantsev — PhD, Associate Professor
Saint Petersburg, 197101
sc 55062303000
References
1. Fierer N. Embracing the unknown: disentangling the complexities of the soil microbiome. Nature Reviews Microbiology, 2017, vol. 15, no. 10, pp. 579–590. https://doi.org/10.1038/nrmicro.2017.87
2. Garner R.E., Kraemer S.A., Onana V.E., Fradette M., Varin M.P., Huot Y., Walsh D.A. A genome catalogue of lake bacterial diversity and its drivers at continental scale. Nature Microbiology, 2023, vol. 8, no. 10, pp. 1920–1934. https://doi.org/10.1038/s41564-023-01435-6
3. Huttenhower C., Gevers D., Knight R., et al. Structure, function and diversity of the healthy human microbiome. Nature, 2012, vol. 486, no. 7402. pp. 207–214. https://doi.org/10.1038/nature11234
4. Olekhnovich E., Ivanov A., Babkina A., Sokolov A., Ulyantsev V., Fedorov D., Ilina E. Consistent stool metagenomic biomarkers associated with the response to melanoma immunotherapy. Msystems, 2023, vol. 8, no. 2. https://doi.org/10.1128/msystems.01023-22
5. Ivanova V., Chernevskaya E., Vasiluev P., Ivanov A., Tolstoganov I., Shafranskaya D., Ulyantsev V., Korobeynikov A., Razin S., Beloborodova N., et al. Hi-C metagenomics in the ICU: exploring clinically relevant features of gut microbiome in chronically critically ill patients. Frontiers in Microbiology, 2022, vol. 12, pp. 770323. https://doi.org/10.3389/fmicb.2021.770323
6. Olekhnovich E., Ivanov A., Ulyantsev V., Ilina E. Separation of donor and recipient microbial diversity allows determination of taxonomic and functional features of gut microbiota restructuring following fecal transplantation. Msystems, 2021, vol. 6, no. 4. pp. e00811-21. https://doi.org/10.1128/msystems.00811-21
7. Lloyd-Price J., Arze C., Ananthakrishnan A.N., Schirmer M., AvilaPacheco J., Poon T.W., Andrews E., Ajami N.J., Bonham K.S., Brislawn C.J., et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature, 2019, vol. 569, no. 7758, pp. 655–662. https://doi.org/10.1038/s41586-019-1237-9
8. Jie Z., Xia H., Zhong S.-L., Feng Q., Li S., Liang S., Zhong H., Liu Z., Gao Y., Zhao H., et al. The gut microbiome in atherosclerotic cardiovascular disease. Nature Communications, 2017, vol. 8, pp. 845. https://doi.org/10.1038/s41467-017-00900-1
9. Yu J., Feng Q., Wong S.H., Zhang D., Liang Q., Qin Y., Tang L., Zhao H., Stenvang J., Li Y., et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut, 2017, vol. 66, no. 1, pp. 70–78. https://doi.org/10.1136/gutjnl-2015-309800
10. Qin J., Li Y., Cai Z., Li S., Zhu J., Zhang F., Liang S., Zhang W., Guan Y., Shen D., et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 2012, vol. 490, no. 7418, pp. 55–60. https://doi.org/10.1038/nature11450
11. Idury R.M., Waterman M.S. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 1995, vol. 2, no. 2, pp. 291–306. https://doi.org/10.1089/cmb.1995.2.291
12. Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 2001, vol. 98, no. 17, pp. 9748–9753. https://doi.org/10.1073/pnas.171285098
13. Compeau P.E., Pevzner P.A., Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 2011, vol. 29, no. 11, pp. 987–991. https://doi.org/10.1038/nbt.2023
14. Compeau P., Pevzner P. Bioinformatics Algorithms. Active Learning Publishers, 2018, 728 p.
15. Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: new versatile metagenomic assembler. Genome Research, 2017, vol. 27, no. 5, pp. 824–834. https://doi.org/10.1101/gr.213959.116
16. Kolmogorov M., Bickhart D.M., Behsaz B., Gurevich A., Rayko M., Shin S.B., Kuhn K., Yuan J., Polevikov E., Smith T.P., et al. metaFlye: scalable long- read metagenome assembly using repeat graphs. Nature Methods, 2020, vol. 17, no. 11, pp. 103–1110. https://doi.org/10.1038/s41592-020-00971-x
17. Bankevich A., Bzikadze A.V., Kolmogorov M., Antipov D., Pevzner P.A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 2022, vol. 40, no. 7, pp. 1075–1081. https://doi.org/10.1038/s41587-022-01220-6
18. Meyer F., Fritz A., Deng Z.-L., Koslicki D., Lesker T.R., Gurevich A., Robertson G., Alser M., Antipov D., Beghini F., et al. Critical assessment of metagenome interpretation: the second round of challenges. Nature Methods, 2022, vol. 19, no. 4, pp. 429–440. https://doi.org/10.1038/s41592-022-01431-4
19. Pereira-Marques J., Hout A., Ferreira R. M., Weber M., PintoRibeiro I., Van Doorn L.-J., Knetsch C. W., Figueiredo C. Impact of host DNA and sequencing depth on the taxonomic resolution of whole metagenome sequencing for microbiome analysis. Frontiers in Microbiology, 2019, vol. 10, pp. 1277. https://doi.org/10.3389/fmicb.2019.01277
20. Marçais G., Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011, vol. 27, no. 6, pp. 764–770. https://doi.org/10.1093/bioinformatics/btr011
21. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N., Koren S., Phillippy A.M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 2016, vol. 17, pp. 132. https://doi.org/10.1186/s13059-016-0997-x
22. Maillet N., Collet G., Vannier T., Lavenier D., Peterlongo P. COMMET: comparing and combining multiple metagenomic datasets // Proc. of the IEEE international conference on bioinformatics and biomedicine (BIBM). 2014. pp. 94–98. https://doi.org/10.1109/BIBM.2014.6999135
23. Rahman A., Hallgrímsdóttir I., Eisen M., Pachter L. Association mapping from sequencing reads using k-mers. Elife, 2018, vol. 7, pp. e32920. https://doi.org/10.7554/eLife.32920
24. Wang Y., Chen Q., Deng C., Zheng Y., Sun F. KmerGO: a tool to identify group-specific sequences with k-mers. Frontiers in Microbiology, 2020, vol. 11, pp. 2067. https://doi.org/10.3389/fmicb.2020.02067
25. Greenwood P.E., Nikulin M.S. A Guide to Chi-Squared Testing. John Wiley & Sons, 1996, 304 p.
26. Cramér H. Mathematical Methods of Statistics. Princeton University Press, 2019, 575 p.
27. Hettmansperger T.P., McKean J.W. Robust nonparametric statistical methods. CRC press, 2010, 554 p.
28. Dunn O.J. Multiple comparisons among means. Journal of the American Statistical Association, 1961, vol. 56, no. 293, pp. 52–64. https://doi.org/10.1080/016f21459.1961.10482090
29. Gourlé H., Karlsson-Lindsjö O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 2019. vol. 35, no. 3, pp. 521–522. https://doi.org/10.1093/bioinformatics/bty630
30. Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology, 2019, vol. 20, no. 1, pp. 257. https://doi.org/10.1186/s13059-019-1891-0
31. Breiman L. Random forests. Machine Learning, 2001, vol. 45, no. 1, pp. 5–32. https://doi.org/10.1023/A:1010933404324
32. Pedregosa F.,Varoquaux, G., Gramfort, A., Michel, V., et al. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 2011, vol. 12, pp. 2825–2830.
33. Buckland M., Gey F. The relationship between recall and precision. Journal of the American Society for Information Science, 1994, vol. 45, no. 1, pp. 12–19. https://doi.org/10.1002/(sici)1097-4571(199401)45:1<12::aid-asi2>3.0.co;2-l
Review
For citations:
Ivanov A.B., Shalyto A.A., Ulyantsev V.I. Feature extraction methods for metagenome de Bruijn graphs collections based on samples classification information. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(3):545-553. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-3-545-553