Information Clustering Algorithms Review and Analysis

Farah Abbac Sari; Ali Abdulkarem Habib

Farah Abbac Sari
Ali Abdulkarem Habib

Keywords: Clustering, Big data, Classification, Criterion.

Abstract

Clustering algorithms are a powerful tool for analysing and classifying large amounts of data by dividing this information into clusters, so as to group the objects into one cluster when they are similar on certain metrics. To solve this problem, a large number of methods and algorithms have been developed. Due to the diversity of the way these algorithms work and the variables required for them, remains an urgent problem of selecting specific algorithms that provide accurate results and less consumsion of time for processing large data. The paper presents an attempt to classify the existing methods and algorithms, as well as analyze their applications for processing big data, so that we can choose the appropriate algorithm for the hashing process, based on a comparison of its performance indicators.

References

[1] Sadaaki Miyamoto, Hidetomo Ichihashi, and Katsuhiro Honda, Algorithms for Fuzzy Clustering: Methods in c-Means Clustering with Applications: Springer - Incorporated, 2008; URL: https://dl.acm.org/doi/book/10.5555/1817129#cited-by-sec
[2] Keller, James M., “Fuzzy Set Methods For Object Recognition In Space Applications,” NASA Johnson Space Center , United States, 1992; URL: https://ntrs.nasa.gov/api/citations/19920023849/downloads/19920023849.pdf.
[3] Gan, Guojun and Ma, Chaoqun and Wu, Jianhong, Data Clustering: Theory, Algorithms, and Applications, Society for Industrial and Applied Mathematics, 2007, https://doi.org/10.1137/1.9780898718348.

[4] Viktor Mayer-Schönberger, Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Boston New York: Eamon Dolan/Houghton Mifflin Harcourt, 2013, https://doi.org/10.1093/aje/kwu085.

[5] Nada Elgendy, Ahmed Elragal, “Big Data Analytics: A Literature Review Paper,” Springer International Publishing Switzerland, pp. 214-227, 2014, doi. 10.1007/978-3-319-08976-8_16.

[6] D. P. Acharjya, Kauser Ahmed P, “A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools,” International Journal of Advanced Computer Science and Applications, vol. 7, nº 2, pp. 511-518, 2016, doi : 10.14569/IJACSA.2016.070267.

[7] Bensaid, Amine Mhamed, Improved Fuzzy Clustering for Pattern Recognition with Applications to Image Segmentation, University of South Florida, USA, 1995, URL: https://dl.acm.org/doi/book/10.5555/221058
[8] Ming-Chuan Hung, Don-Lin Yang, “An efficient Fuzzy C-Means clustering algorithm,” em IEEE International Conference on Data Mining, San Jose, CA, USA, 2001, doi: 10.1109/ICDM.2001.989523.

[9] Janmenjoy Nayak, Bighnaraj Naik, H.S. Behera, “Fuzzy C-Means (FCM) ClusteringAlgorithm: A Decade Reviewfrom 2000 to 2014,” em Computational Intelligence in Data Mining, Odisha, India, Springer India, 2015, pp. 133-149, https://doi.org/10.1007/978-81-322-2208-8_14.

[10] Zhang T., Ramakrishnan R., Livny M., “BIRCH: an efficient data clustering method for very large databases,” SIGMOD international conference on Management of data - SIGMOD '96., vol. 25, nº 2, pp. 103-114, 1996, https://doi.org/10.1145/235968.233324

[11] Zhang, T., Ramakrishnan, R. & Livny, M., “BIRCH: A New Data Clustering Algorithm and Its Applications,” Data Mining and Knowledge Discovery, vol. 1, p. 141–182, 1997, https://doi.org/10.1023/A:1009783824328.

[12] Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, Axel Küpper, “Variations on the Clustering Algorithm BIRCH,” Big Data Research, vol. 11, pp. 44-53, 2018, https://doi.org/10.1016/j.bdr.2017.09.002.

[13] Hinneburg A., Keim D. A., “efﬁcient approach to clustering in large multimedia databases with noise,” Proc. ACM SIGKDD Conf. Knowl. Discovery Ad Data Mining (KDD), p. 58–65, 1998, URL: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.aaai.org/Papers/KDD/1998/KDD98-009.pdf.

[14] Hajar REHIOUI, Abdellah IDRISSI, Manar ABOUREZQ, Faouzia ZEGRAR, “DENCLUE-IM: A New Approach for Big Data Clustering,” Procedia Computer Science, vol. 83, p. 560 – 567, 2016, DOI: 10.1016/j.procs.2016.04.265.

[15] Hinneburg A., Keim D. A., “Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering,” Hinneburg A., Keim D. A. Optimal grid-clustering: Towards breaking the curse of diProc. 25th Int. Conf. Very Large Data Bases (VLDB), p. 506–517, 1999, chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://kops.uni-konstanz.de/bitstream/handle/123456789/5790/vldb99.pdf?sequence=1.

[16] Babu, B.Hari, N.Subash Chandra, and T. Venu Gopal., “Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches,” Babu, B.Hari, N.Subash Chandra, and T. Venu Gopal. “Clustering International Journal of Computer Science and Informatics, pp. 293-299, 2013, doi:10.47893/IJCSI.2013.1108.

[17] Neal, R.M., Hinton, G.E., “A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants,” em NATO Science Series D, Jordan, M.I. (eds) Learning in Graphical Models, Springer, Dordrecht, 1998, p. 355–368, URL: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://glizen.com/radfordneal/ftp/emk.pdf

[18] Han J., Kamber M., Data Mining: Concepts and Techniques, San Mateo: CA, USA: Morgan Kaufmann, 2006, URL: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-2011.pdf.

[19] Yair Meidan, Michael Bohadana, Yael Mathov, Yisroel Mirsky, Dominik Breitenbacher, Asaf Shabtai, and Yuval Elovici, “N-BaIoT: Network-based Detection of IoT Botnet Attacks Using Deep Autoencoders,” IEEE PERVASIVE COMPUTING, vol. 13, nº 9, pp. 1-8, 201, doi: 10.1109/MPRV.2018.03367731.

[20] Stisen A., Blunck H., Bhattacharya S., and et al., “Stisen A., BlunckSmart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogeneities for Activity Recognition,” em Stisen A., Blunck H., Bhattacharya S., and et al. Smart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogen13th ACM Conference on Embedded Networked Sensor Systems, Stisen A., Blunck H., Bhattacharya S., and et al. Smart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogeneities for Activity RSeoul, Kor, Stisen A., Blunck H., Bhattacharya S., and et al. Smart Devices are Different: Assessing and Mitigating Mobile Sensing Heterogeneities for Activity 2015, URL: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://pure.au.dk/ws/files/93103132/sen099_stisenAT3.pdf.

[21] A. N. Mahmood, C. Leckie and P. Udaya, “An efﬁcient clustering scheme to exploit hierarchical data in network trafﬁc analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, nº 6, p. 752–767, June 2008, doi: 10.1109/TKDE.2007.190725.

[22] Yin, L.; Li, M.; Chen, H.; Deng,W. An Improved Hierarchical Clustering Algorithm Based on the Idea of Population Reproduction and
Fusion. Electronics 2022, 11, 2735. https://doi.org/10.3390/electronics11172735.