Vector embeddings compression using clustering with the ensemble of oblivious decision trees and separate centroids storage
https://doi.org/10.17586/2226-1494-2025-25-5-902-909
Abstract
The modern approach to search textual and multimodal data in large collections involves the transformation of the documents into vector embeddings. To store these embeddings efficiently different approaches could be used, such as quantization, which results in loss of precision and reduction of search accuracy. Previously, a method was proposed that reduces the loss of precision during quantization. In that method, clustering of the embeddings with k-Means algorithm is performed, then a bias, or delta, being the difference between the cluster centroid and vector embedding, is computed, and then only this delta is quantized. In this article a modification of that method is proposed, with a different clustering algorithm, the ensemble of Oblivious Decision Trees. The essence of the method lies in training an ensemble of binary Oblivious Decision Trees. This ensemble is used to compute a hash for each of the original vectors, and the vectors with the same hash are considered as belonging to the same cluster. In case when the resulting cluster count is too big or too small for the dataset, a reclustering process is also performed. Each cluster is then stored using two different files: the first file contains the per-vector biases, or deltas, and the second file contains identifiers and the positions of the data in the first file. The data in the first file is quantized and then compressed with a general-purpose compression algorithm. The usage of Oblivious Decision Trees allows us to reduce the size of the storage compared to the same storage organization with k-Means clustering. The proposed clustering method was tested on Fashion-MNIST-784-Euclidean and NYT-256-angular dataset against the k-Means clustering. The proposed method demonstrates a better compression quality compared to clustering via k-Means, demonstrating up to 4.7 % less storage size for NF4 quantization for Brotli compression algorithm. For other compression algorithms the storage size reduction is less noticeable. However, the proposed clustering algorithm provides a bigger error value compared to k-Means, up to 16 % in the worst-case scenario. Compared to Parquet, the proposed clustering method demonstrates a lesser error value for the Fashion-MNIST-784Euclidean dataset when using quantizations FP8 and NF4. For the NYT-256-angular dataset, compared to Parquet, the proposed method allows better compression for all tested quantization types. These results suggest that the proposed clustering method can be utilized not only for the nearest neighbor search applications, but also for compression tasks, when the increase in the quantization error can be ignored.
About the Author
N. A. TomilovRussian Federation
Nikita A. Tomilov — PhD Student
sc 57225127284
Saint Petersburg, 197101
References
1. Grbovic M., Cheng H. Real-time personalization using embeddings for search ranking at airbnb. Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 311–320. https://doi.org/10.1145/3219819.3219885
2. Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z. Fast nearest neighbor search in medical image databases. Proc. of the 22th International Conference on Very Large Data Bases, 1996, pp. 215–226. https://doi.org/10.5555/645922.673493
3. Zhou W., Lu Y., Li H., Tian Q. Scalar quantization for large scale image search. Proc. of the 20th ACM International Conference on Multimedia, 2012, pp. 169–178. https://doi.org/10.1145/2393347.2393377
4. Zhang J., Yang J., Yuen H. Training with low-precision embedding tables // Systems for Machine Learning Workshop at NeurIPS. 2018. V. 2018.
5. Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. A method of storing vector data in compressed form using clustering. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2024, vol. 24, no. 1, pp. 112–117. (in Russian). https://doi.org/10.17586/2226-1494-2024-24-1-112-117
6. Tomilov N.A., Turov V.P., Babayants A.A., Platonov A.V. Vector search using method of clustering using ensemble of oblivious trees. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2025, vol. 25, no. 2. pp. 339–344. https://doi.org/10.17586/2226-1494-2025-25-2-339-344
7. Kanungo T., Mount D., Netanyahu N., Piatko Ch., Silverman R., Wu A. The analysis of a simple k-means clustering algorithm. Proc. of the 16th Annual Symposium on Computational Geometry, 2000, pp. 100–109. https://doi.org/10.1145/336154.336189
8. Lou Y., Obukhov M. BDT: gradient boosted decision tables for high accuracy and scoring efficiency. Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1893–1901. https://doi.org/10.1145/3097983.3098175
9. Kuzmin A., van Baalen M., Ren Y., Nagel M., Peters J., Blankevoort T. Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 14651–14662.
10. Dettmers T., Pagnoni A., Holtzman A., Zettlemoyer L. QloRA: efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 2023, vol. 36, pp. 1–5.
11. Alakuijala J., Farruggia A., Ferragina P., Kliuchnikov E., Obryk R., Szabadka Z., Vandevenne L. Brotli: A general-purpose data compressor. ACM Transactions on Information Systems (TOIS), 2018, vol. 37, no. 1, pp. 1–30. https://doi.org/10.1145/3231935
12. Xiao H., Rasul K., Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv, 2017, arXiv:1708.07747. https://doi.org/10.48550/arXiv.1708.07747
13. Aumüller M., Bernhardsson E., Faithfull A. ANN-Benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Lecture Notes in Computer Science, 2017, vol. 10609, pp. 34–49. https://doi.org/10.1007/978-3-319-68474-1_3
14. Zeng X., Hui Y., Shen J., Pavlo A., McKinney W., Zhang H. An empirical evaluation of columnar storage formats. Proc. of the VLDB Endowment, 2023, vol. 17, no. 2, pp. 148–161. https://doi.org/10.14778/3626292.3626298
15. Turov V.P., Tomilov N.A., Babayants A.A. Developing a tool to compare vector search algorithms. Proc. of the 11th Congress of Young Scientists, 2022, vol. 1, pp. 446–450. (in Russian)
16. Laaber C., Leitner P. An evaluation of open-source software microbenchmark suites for continuous performance assessment. Proc. of the 15th International Conference on Mining Software Repositories ( M S R ‘ 1 8 ) , 2 0 1 8 , p p . 11 9 – 1 3 0 . https://doi.org/10.1145/3196398.3196407
Review
For citations:
Tomilov N.A. Vector embeddings compression using clustering with the ensemble of oblivious decision trees and separate centroids storage. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(5):902-909. https://doi.org/10.17586/2226-1494-2025-25-5-902-909
 
                    
 
        






























 
             
  Email this article
            Email this article