Preview

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

Advanced search

Clustering in big data analytics: a systematic review and comparative analysis (review article)

https://doi.org/10.17586/2226-1494-2023-23-5-967-979

Abstract

In the modern world, the widespread use of information and communication technology has led to the accumulation of vast and diverse quantities of data, commonly known as Big Data. This necessitates the need for novel concepts and analytical techniques to help individuals extract meaningful insights from rapidly increasing volumes of digital data. Clustering is a fundamental approach used in data mining to retrieve valuable information. Although a wide range of clustering methods have been described and implemented in various fields, the sheer variety complicates the task of keeping up with the latest advancements in the field. This research aims to provide a comprehensive evaluation of the clustering algorithms developed for Big Data highlighting their various features. The study also conducts empirical evaluations on six large datasets, using several validity metrics and computing time to assess the performance of the clustering methods under consideration.

About the Author

H. Shili
University of Tabuk
Saudi Arabia

Hechmi Shili — PhD, Assistant Professor

sc 26027243600

Tabuk, 47311



References

1. Hinneburg A., Keim D.A. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. Proc. of the 25th International Conference on Very Large Data Bases, 1999, pp. 506–517.

2. Hinneburg A., Keim D.A. An efficient approach to clustering in large multimedia databases with noise. Proc. of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, pp. 58–65.

3. Guha S., Rastogi R., Shim K. ROCK: a robust clustering algorithm for categorical attributes. Proc. of the 15th International Conference on Data Engineering, 1999, pp. 512–521. https://doi.org/10.1109/icde.1999.754967

4. Gennari J.H., Langley P., Fisher D. Models of incremental concept formation. Artificial Intelligence, 1989, vol. 40, no. 1-3, pp. 11–61. https://doi.org/10.1016/0004-3702(89)90046-5

5. Kaufman L., Rousseeuw P.J. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990, 342 p.

6. Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognition, 2013, vol. 46, no. 1, pp. 243–256. https://doi.org/10.1016/j.patcog.2012.07.021

7. Xu D., Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2015, vol. 2, no. 2, pp. 165–193. https://doi.org/10.1007/s40745-015-0040-1

8. Sinaga K.P., Yang M. Unsupervised k-means clustering algorithm. IEEE Access, 2020, vol. 8, pp. 80716–80727. https://doi.org/10.1109/ACCESS.2020.2988796

9. Shili H., Romdhane L.B. IF-CLARANS: intuitionistic fuzzy algorithm for big data clustering. Communications in Computer and Information Science, 2018, vol. 854, pp. 39–50. https://doi.org/10.1007/978-3-319-91476-3_4

10. Karypis G., Han E.H., Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, vol. 32, no. 8, pp. 68–75. https://doi.org/10.1109/2.781637

11. Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, vol. PAMI-1, no. 2, pp. 224–227. https://doi.org/10.1109/TPAMI.1979.4766909

12. Ankerst M., Breunig M., Kriegel H., Sander J. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, vol. 28, no. 2, pp. 49–60. https://doi.org/10.1145/304181.304187

13. Cai Z., Wang J., He K. Adaptive density-based spatial clustering for massive data analysis. IEEE Access, 2020, vol. 8, pp. 23346–23358. https://doi.org/10.1109/ACCESS.2020.2969440

14. Wang W., Yang J., Muntz R. STING: a statistical information grid approach to spatial data mining. Proc. of the 23th International Conference on Very Large Data Bases, 1997, pp. 186–195.

15. Vanschoren J., van Rijn J.N., Bischl B., Torgo L. OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter, 2013, vol. 15, no. 2, pp. 49–60. https://doi.org/10.1145/2641190.2641198

16. Goil S., Nagesh H., Choudhary A. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC- TR-9906-010, 1999.

17. Agrawal R., Gehrke J., Gunopulos D., Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 94–105. https://doi.org/10.1145/276305.276314

18. Dempster P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 1977, vol. 39, no. 1, pp. 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

19. Calinski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics — Simulation and Computation, 1974, vol. 3, no. 1, pp. 1–27. https://doi.org/10.1080/03610917408548446

20. Dunn J. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 1974, vol. 4, no. 1, pp. 95–104. https://doi.org/10.1080/01969727408546059

21. Canbay Y., Sağıroğlu S. Big data anonymization with spark. Proc. of the 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 833–838. https://doi.org/10.1109/UBMK.2017.8093543

22. Lorbeer B., Kosareva A., Deva B., Softić D., Ruppel P., Küpper A. Variations on the Clustering Algorithm BIRCH. Big Data Research, 2018, vol. 11, pp. 44–53. https://doi.org/10.1016/j.bdr.2017.09.002

23. Tsai C., Huang S. An effective and efficient grid-based data clustering algorithm using intuitive neighbor relationship for data mining. Proc. of the 2015 International Conference on Machine Learning and Cybernetics (ICMLC), 2015, pp. 478–483. https://doi.org/10.1109/ICMLC.2015.7340603

24. Kailing K., Kriegel H., Kröger P. Density-connected subspace clustering for high-dimensional data. Proc. of the 2014 SIAM International Conference on Data Mining, 2004, pp. 246–257. https://doi.org/10.1137/1.9781611972740.23

25. Kohonen T. The self-organizing map. Proceedings of the IEEE, 1990, vol. 78, no. 9, pp. 1464–1480. https://doi.org/10.1109/5.58325

26. Bandyopadhyay S., Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering, 2008, vol. 20, no. 11, pp. 1441– 1457. https://doi.org/10.1109/tkde.2008.79

27. Guha S., Rastogi R., Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Record, 1998, vol. 27, no. 2, pp. 73–84. https://doi.org/10.1145/276305.276312

28. Mahmud M.S., Huang J.Z., Salloum S., Emara T.Z., Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 2020, vol. 3, no. 2, pp. 85–101. https://doi.org/10.26599/BDMA.2019.9020015

29. Djouzi K., Beghdad-Bey K. A review of clustering algorithms for big data. Proc. of the International Conference on Networking and Advanced Systems (ICNAS), 2019, pp. 1–6. https://doi.org/10.1109/ICNAS.2019.8807822

30. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A.Y., Foufou S., Bouras A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2014, vol. 2, no. 3, pp. 267–279. https://doi.org/10.1109/TETC.2014.2330519

31. D’Urso P., De Giovanni L., Massari R. Smoothed self-organizing map for robust clustering. Information Sciences, 2020, vol. 512, pp. 381– 401. https://doi.org/10.1016/j.ins.2019.06.038

32. Asuncion A., Newman D.J. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, 2007.

33. Zhang T., Ramakrishnan R., Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, vol. 1, no. 2, pp. 141–182. https://doi.org/10.1023/A:1009783824328

34. MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley Symposium Mathematical Statist. Probability. V. 1, 1967, pp. 281–297.

35. Ng R.T., Han J. Efficient and effective clustering methods for spatial data mining. VLDB ‘94: Proc. of the 20th International Conference on Very Large Data Bases, 1994, pp. 144–144.

36. Fisher D.H. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 1987, vol. 2, no. 2, pp. 139–172. https://doi.org/10.1007/bf00114265

37. Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial data bases with noise. KDD’96: Proc. of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.


Review

For citations:


Shili H. Clustering in big data analytics: a systematic review and comparative analysis (review article). Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2023;23(5):967-979. https://doi.org/10.17586/2226-1494-2023-23-5-967-979

Views: 40


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)