An Efficient Algorithm for Initializing Centroids in K-means Clustering
DOI:
https://doi.org/10.31642/JoKMC/2018/030203Keywords:
Data Mining, Clustering, K-means, Centroids, ComplexityAbstract
Clustering represents one of the most popular knowledge extraction algorithms in data mining techniques. Hierarchical and partitioning approaches are widely used in this field. Each has its own advantages, drawbacks and goals. K-means represents the most popular partitioning clusteringtechnique, however it suffers from two major drawbacks; time complexity and its sensitivity to the initial centroid values. The work in this paper presents an approach for estimating the starting initial centroids throughout three process including density based, normalization and smoothing ideas. The proposed algorithm has a strong mathematical foundation. The proposed approach was tested using a free standard data (20000 records). The results showed that the approach has better complexity and ensures the clustering convergence.Downloads
References
Jiawei Han and Micheline Kamber “ Data Mining: Concepts and Techniques” 3rd Edition., Morgan Kaufmann, 2010.
M. Steinbach, P.-N.Tan and V.Kumar, “Introduction to Data Mining”,Addison-Wesley, 2006. ISBN: 0-321-32136-7
M. H. Dunham, “Data Mining:Introductory and Advanced Topics”,Prentice Hall, 2002.
D. J. Hand, H. Mannila, and P. Smyth, “Principles of Data Mining”, MIT Press,2001.
A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys 31 (3) (1999)264–323. DOI: https://doi.org/10.1145/331499.331504
L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience,1990. DOI: https://doi.org/10.1002/9780470316801
D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-Hardness of Euclidean Sum-of-Squares Clustering, MachineLearning 75 (2) (2009) 245–248. DOI: https://doi.org/10.1007/s10994-009-5103-0
A. K. Jain, Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters 31 (8) (2010) 651–666. DOI: https://doi.org/10.1016/j.patrec.2009.09.011
M. Emre Celebi,Hassan A. Kingravi and Patricio A. Velain in theirpaper “A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm”, Expert Systems with Applications, 40(1): 200–210, 2013 DOI: https://doi.org/10.1016/j.eswa.2012.07.021
J. M. Pena, J. A. Lozano, P. Larranaga, An Empirical Comparison of Four Initialization Methods for theK-Means Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040. DOI: https://doi.org/10.1016/S0167-8655(99)00069-0
J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low, Initialization of Cluster Refinement Algorithms: A Reviewand Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, 2004, pp. 297–302.
[12] J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, in: Proc. of the 5thBerkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
R. C. Jancey, Multidimensional Group Analysis, Australian Journal of Botany 14 (1) (1966) 127–130. DOI: https://doi.org/10.1071/BT9660127
E. Forgy, Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification, Biometrics 21(1965) 768.
T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance, Theoretical Computer Science 38 (2–3) (1985) 293–306. DOI: https://doi.org/10.1016/0304-3975(85)90224-5
J. A. Hartigan, M. A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm, Journal of the RoyalStatistical Society C 28 (1) (1979) 100–108. DOI: https://doi.org/10.2307/2346830
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Dr. Ahmed Hussain Aliwy, Dr. Kadhim B. S. Aljanabi
This work is licensed under a Creative Commons Attribution 4.0 International License.
which allows users to copy, create extracts, abstracts, and new works from the Article, alter and revise the Article, and make commercial use of the Article (including reuse and/or resale of the Article by commercial entities), provided the user gives appropriate credit (with a link to the formal publication through the relevant DOI), provides a link to the license, indicates if changes were made and the licensor is not represented as endorsing the use made of the work.