Preview

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

Advanced search

Feature extraction methods for metagenome de Bruijn graphs collections based on samples classification information

https://doi.org/10.17586/2226-1494-2025-25-3-545-553

Abstract

The paper considers the comparative analysis of metagenomic samples collections using de Bruijn graphs. We propose methods for automatic feature extraction based on the results of comparative sample analysis, expert metadata, and statistical tests to improve the accuracy of classification models. In this paper features are connected subgraphs of the de Bruijn graph. The first method, named unique_kmers, is used to extract strings of length k (k-mers) that occur only in samples of the certain class. The second method, named stats_kmers, is used to extract k-mers whose frequency of occurrence statistically differs between sample classes. To extract interpretable features, a third method has been developed that implements the extraction of subgraphs from de Bruijn graphs based on the selected nodes obtained as a result of applying one of the first two methods. Data analysis consists of two stages: firstly, unique_kmers or stats_kmers method is applied for data preprocessing, secondly, the third method is applied to obtain interpretable features. The methods were tested on four generated datasets that model the properties of real metagenomic communities such as the presence of similar species (strains) or differences in the relative abundance of bacteria. The developed methods were used to extract features. Machine learning model was trained in extracted features to classify samples from the test datasets. For comparison, the results of taxonomic annotation of samples using the Kraken2 program were used as features. It was shown that the accuracy of samples classification increased when using features obtained using the proposed methods in classification models compared to classification models trained on taxonomic features. The developed methods are useful for comparative analysis of metagenomic sequencing data and can form the basis of decision support systems, for example, in human diseases diagnostics based on gut microbiota sequencing data.

About the Authors

A. B. Ivanov
Lopukhin FRCC PCM; ITMO University
Russian Federation

Artem B. Ivanov — Junior Researcher

Moscow, 119435;

PhD Student

Saint Petersburg, 197101

sc 57222438932



A. A. Shalyto
ITMO University
Russian Federation

Anatoly A. Shalyto — D.Sc., Chief Researcher, Full Professor

Saint Petersburg, 197101

sc 56131789500



V. I. Ulyantsev
ITMO University
Russian Federation

Vladimir I. Ulyantsev — PhD, Associate Professor

Saint Petersburg, 197101

sc 55062303000



References

1. Fierer N. Embracing the unknown: disentangling the complexities of the soil microbiome. Nature Reviews Microbiology, 2017, vol. 15, no. 10, pp. 579–590. https://doi.org/10.1038/nrmicro.2017.87

2. Garner R.E., Kraemer S.A., Onana V.E., Fradette M., Varin M.P., Huot Y., Walsh D.A. A genome catalogue of lake bacterial diversity and its drivers at continental scale. Nature Microbiology, 2023, vol. 8, no. 10, pp. 1920–1934. https://doi.org/10.1038/s41564-023-01435-6

3. Huttenhower C., Gevers D., Knight R., et al. Structure, function and diversity of the healthy human microbiome. Nature, 2012, vol. 486, no. 7402. pp. 207–214. https://doi.org/10.1038/nature11234

4. Olekhnovich E., Ivanov A., Babkina A., Sokolov A., Ulyantsev V., Fedorov D., Ilina E. Consistent stool metagenomic biomarkers associated with the response to melanoma immunotherapy. Msystems, 2023, vol. 8, no. 2. https://doi.org/10.1128/msystems.01023-22

5. Ivanova V., Chernevskaya E., Vasiluev P., Ivanov A., Tolstoganov I., Shafranskaya D., Ulyantsev V., Korobeynikov A., Razin S., Beloborodova N., et al. Hi-C metagenomics in the ICU: exploring clinically relevant features of gut microbiome in chronically critically ill patients. Frontiers in Microbiology, 2022, vol. 12, pp. 770323. https://doi.org/10.3389/fmicb.2021.770323

6. Olekhnovich E., Ivanov A., Ulyantsev V., Ilina E. Separation of donor and recipient microbial diversity allows determination of taxonomic and functional features of gut microbiota restructuring following fecal transplantation. Msystems, 2021, vol. 6, no. 4. pp. e00811-21. https://doi.org/10.1128/msystems.00811-21

7. Lloyd-Price J., Arze C., Ananthakrishnan A.N., Schirmer M., AvilaPacheco J., Poon T.W., Andrews E., Ajami N.J., Bonham K.S., Brislawn C.J., et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature, 2019, vol. 569, no. 7758, pp. 655–662. https://doi.org/10.1038/s41586-019-1237-9

8. Jie Z., Xia H., Zhong S.-L., Feng Q., Li S., Liang S., Zhong H., Liu Z., Gao Y., Zhao H., et al. The gut microbiome in atherosclerotic cardiovascular disease. Nature Communications, 2017, vol. 8, pp. 845. https://doi.org/10.1038/s41467-017-00900-1

9. Yu J., Feng Q., Wong S.H., Zhang D., Liang Q., Qin Y., Tang L., Zhao H., Stenvang J., Li Y., et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut, 2017, vol. 66, no. 1, pp. 70–78. https://doi.org/10.1136/gutjnl-2015-309800

10. Qin J., Li Y., Cai Z., Li S., Zhu J., Zhang F., Liang S., Zhang W., Guan Y., Shen D., et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 2012, vol. 490, no. 7418, pp. 55–60. https://doi.org/10.1038/nature11450

11. Idury R.M., Waterman M.S. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 1995, vol. 2, no. 2, pp. 291–306. https://doi.org/10.1089/cmb.1995.2.291

12. Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 2001, vol. 98, no. 17, pp. 9748–9753. https://doi.org/10.1073/pnas.171285098

13. Compeau P.E., Pevzner P.A., Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 2011, vol. 29, no. 11, pp. 987–991. https://doi.org/10.1038/nbt.2023

14. Compeau P., Pevzner P. Bioinformatics Algorithms. Active Learning Publishers, 2018, 728 p.

15. Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: new versatile metagenomic assembler. Genome Research, 2017, vol. 27, no. 5, pp. 824–834. https://doi.org/10.1101/gr.213959.116

16. Kolmogorov M., Bickhart D.M., Behsaz B., Gurevich A., Rayko M., Shin S.B., Kuhn K., Yuan J., Polevikov E., Smith T.P., et al. metaFlye: scalable long- read metagenome assembly using repeat graphs. Nature Methods, 2020, vol. 17, no. 11, pp. 103–1110. https://doi.org/10.1038/s41592-020-00971-x

17. Bankevich A., Bzikadze A.V., Kolmogorov M., Antipov D., Pevzner P.A. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 2022, vol. 40, no. 7, pp. 1075–1081. https://doi.org/10.1038/s41587-022-01220-6

18. Meyer F., Fritz A., Deng Z.-L., Koslicki D., Lesker T.R., Gurevich A., Robertson G., Alser M., Antipov D., Beghini F., et al. Critical assessment of metagenome interpretation: the second round of challenges. Nature Methods, 2022, vol. 19, no. 4, pp. 429–440. https://doi.org/10.1038/s41592-022-01431-4

19. Pereira-Marques J., Hout A., Ferreira R. M., Weber M., PintoRibeiro I., Van Doorn L.-J., Knetsch C. W., Figueiredo C. Impact of host DNA and sequencing depth on the taxonomic resolution of whole metagenome sequencing for microbiome analysis. Frontiers in Microbiology, 2019, vol. 10, pp. 1277. https://doi.org/10.3389/fmicb.2019.01277

20. Marçais G., Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011, vol. 27, no. 6, pp. 764–770. https://doi.org/10.1093/bioinformatics/btr011

21. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N., Koren S., Phillippy A.M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 2016, vol. 17, pp. 132. https://doi.org/10.1186/s13059-016-0997-x

22. Maillet N., Collet G., Vannier T., Lavenier D., Peterlongo P. COMMET: comparing and combining multiple metagenomic datasets // Proc. of the IEEE international conference on bioinformatics and biomedicine (BIBM). 2014. pp. 94–98. https://doi.org/10.1109/BIBM.2014.6999135

23. Rahman A., Hallgrímsdóttir I., Eisen M., Pachter L. Association mapping from sequencing reads using k-mers. Elife, 2018, vol. 7, pp. e32920. https://doi.org/10.7554/eLife.32920

24. Wang Y., Chen Q., Deng C., Zheng Y., Sun F. KmerGO: a tool to identify group-specific sequences with k-mers. Frontiers in Microbiology, 2020, vol. 11, pp. 2067. https://doi.org/10.3389/fmicb.2020.02067

25. Greenwood P.E., Nikulin M.S. A Guide to Chi-Squared Testing. John Wiley & Sons, 1996, 304 p.

26. Cramér H. Mathematical Methods of Statistics. Princeton University Press, 2019, 575 p.

27. Hettmansperger T.P., McKean J.W. Robust nonparametric statistical methods. CRC press, 2010, 554 p.

28. Dunn O.J. Multiple comparisons among means. Journal of the American Statistical Association, 1961, vol. 56, no. 293, pp. 52–64. https://doi.org/10.1080/016f21459.1961.10482090

29. Gourlé H., Karlsson-Lindsjö O., Hayer J., Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 2019. vol. 35, no. 3, pp. 521–522. https://doi.org/10.1093/bioinformatics/bty630

30. Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology, 2019, vol. 20, no. 1, pp. 257. https://doi.org/10.1186/s13059-019-1891-0

31. Breiman L. Random forests. Machine Learning, 2001, vol. 45, no. 1, pp. 5–32. https://doi.org/10.1023/A:1010933404324

32. Pedregosa F.,Varoquaux, G., Gramfort, A., Michel, V., et al. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 2011, vol. 12, pp. 2825–2830.

33. Buckland M., Gey F. The relationship between recall and precision. Journal of the American Society for Information Science, 1994, vol. 45, no. 1, pp. 12–19. https://doi.org/10.1002/(sici)1097-4571(199401)45:1<12::aid-asi2>3.0.co;2-l


Review

For citations:


Ivanov A.B., Shalyto A.A., Ulyantsev V.I. Feature extraction methods for metagenome de Bruijn graphs collections based on samples classification information. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(3):545-553. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-3-545-553

Views: 10


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)