An Approach for Solving Missing Values in Data Set Using Clustering-Curve Fitting Technique

Missing values in data sets represent one of the greatest challenge in analyzing data to extract knowledge from the data set. The work in this paper presents a new approach for solving the missing values problems by using and merging two different techniques; clustering (K-means and Expectation Maximization) and curve fitting. More than twenty thousand records of real health data set collected from different Iraqi hospitals were used to create and test the proposed approach that showed better results than the most popular techniques for estimation missing values such as most common values, overall overage, class average, and class most common values. Different software were used in the proposed work including WEKA (Waikato Environment for Knowledge Analysis), Matlab, Excel and C++.


Introduction
Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.This encompasses a number of different technical approaches, such as classification, association, prediction, clustering, data summarization, learning classification rules, analyzing changes, and detecting anomalies.
The analogy with the mining process is described as: Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation.The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful" Basically data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.It is the computer which is responsible for finding the patterns by identifying the underlying rules and features in the data.The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before.
The overall mining process and due to the data integration from different sources suffers from three main difficulties, they are: 1. Incompleteness 2. Inconsistent 3. Noisy data Incompleteness referees to the existence of what is known as missing values in attributes or tuples that may occur due to many reasons like human errors, instrument bad functioning and others.Data inconsistent refers to different naming and representation of data since they are collected from different sources each has its own data format and platforms.Noisy data may occur due to human errors, data conversion, machine restriction and others.
In On Line Transaction Processing (OLTP), the difficulties mentioned above do not represent big obstacle in data processing, whereas they represent a big challenge in On Line Analytical Processing due to the required nature of data that can be directed for analysis process other than the transactional approach.

Data Collection
In this research, the real data were collected from Al-Sader Medical City in Najaf at Iraq.More than 61500 records were collected from which only 20000 records have been used in the proposed solutions, Table (1) shows sample of the collected data.One hundred records from the selected data were used as missing values by deleting the age field.The data collected was stored in the FoxPro program.Then, the data was converted into Excel format by method of exporting the data into HTML format and then importing it into Excel.
The following characteristics were noticed in the collected data: 1. Some information collected was not consistent with the goal of the project and the whole idea of the patient data.They are not useful for analysis purposes.2. Noisy data were deleted from the data set due to human errors but they were very rare and do not affect the overall analytical process.

3.1.
Clustering Clustering techniques consider data tuples as objects.They partition the objects into groups or clusters, so that objects within a cluster are "similar" to one another and "dissimilar" to objects in other clusters.Similarity is commonly defined in terms of how "close" the objects are in space, based on a distance function.The "quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster.Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid (denoting the "average object," or average point in space for the cluster).Next the centroid, or mean, of the instances in each cluster is calculated-this is the "means" part.These centroids are taken to be new center values for their respective clusters.Finally, the whole process is repeated with the new cluster centers.Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and will remain the same forever.[6,7] Error from the data points not in a cluster k can be calculated using the following equation typically, the square-error criterion is used, defined as where is the sum of the square error for all objects in the data set; is the point in space representing a given object; and mi is the mean of cluster (both and are multidimensional).In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed.This criterion tries to make the resulting k clusters as compact and as separate as possible.

Curve Fitting Techniques
There are two general approaches for curve fitting that are distinguished from each other on the basis of the amount of error associated with the data.First, where the data exhibits a significant degree of error or "noise," the strategy is to derive a single curve that represents the general trend of the data.Because any individual data point may be incorrect, we make no effort to intersect every point.Rather, the curve is designed to follow the pattern of the points taken as a group.One approach of this nature is called least-squares regression .
Second, where the data is known to be very precise, the basic approach is to fit a curve or a series of curves that pass directly through each of the points.Such data usually originates from tables.For examples the values of the density of water or for the heat capacity of gases as a function of temperature.The estimation of values between well-known discrete points is called interpolation.

Least Squares Method
The method of least squares assumes that the best-fit curve of a given type is the curve that has the minimal sum of the deviations squared (least square error) from a given set of data.Suppose that the data points are( ) ( ) ( ) where is the independent variable and is the dependent variable.The fitting curve ( ) has the deviation (error) d from each data point, i.e.
( ) ( ) ( ) …(3) According to the method of least squares, the best fitting curve has the property that: ..( 5) are unknown coefficients while all and are given.To obtain the least square error, the unknown coefficients must yield zero first derivatives.
Expanding the above equations, we have Put these into matrix form we have the data points( ) for i=1,2,…,n so we have all the summation terms in the matrix so unknows are a,b and c.

Dealing with Missing Values
Table (1) shows a sample of the data set that contain some missing values in patient age attribute.It is required to replace these missing values in order to prepare the data for the analysis process and since good decisions rely mainly on good data analysis reports (knowledge extracted from the data), that in turn require good quality data which means complete, correct and consistent, then missing values data are at the core of this research.Algorithms optimization using entropy and information gain is one of the research goals.The following algorithms and techniques are the most common in replacing the missing values:

Most Common Values (Mode)
In some cases, missing values can be replaced by the most common value in the data set, especially when the frequency of the most common value represents high percentage of the given data.In our case for the patient data set given in Table (1), the most common value is shown in Table (2).

Overall Data set Average
Numeric attributes have more alternatives and algorithms for filling missing values in the data set, one of which is the overall average of the whole data set for the given attribute.This method used when the frequencies of the categories are close to each other .However, in some cases this may lead to results with a high standard deviation from the original.For the data set used in this research, the overall average for the age attribute shown in Table (2).

Class Most Common
Values As classification represents putting objects together according to predefined specifications, this means that the objects of a specific class have a lot in common.This characteristic can be used to replace the missing values.To get better estimation for the missing values in the data set, more efficient algorithms are available, one of which is to classify the data set according to some specific attributes and in our cases, data set is classified according to the patients gender (male and female) and the disease type.The most common values are then found for each class and these values are used to fill in the missing values in the objects under the same class.This method can be used for all data types of the attributes.As in Tables (3), ( 4),( 5) and (6).

5.4.
Class Average When the frequencies of the occurrence of the objects within one class are close to each other, overall class average can be used to replace the missing values .For the numeric values attributes, the class average is an effective method for filling the missing values.Dividing the original table into smaller tables according to some predefined classes will lead to better estimation of the targeted missing values, as shown in Tables (3), (4),( 5) and (6).

Expectation Maximization (EM) Clustering Technique and K-Means Clustering
When the attributes of the data to be classified are not defined prior to the classification process, this is known as clustering.Putting patients into groups: each group has some common characteristics (similarities) between group objects and has dissimilarities with other groups is known as clustering.There are many clustering techniques but in our research we used k-means and EM algorithms and the results in Tables (7) and ( 8).

Curve Fitting
The curve fitting is applied for each cluster after clustering, and instead of taking the average or mean of each cluster to find out the missing values, a curve fitting is applied.As in Table ( 11)  For more details see Table (11).
[3] Wei, W. ---Tang, Y., " A generic neural network approach for filling missing data in data mining", Systems, Man and
1.1.K-means AlgorithmThe classic clustering technique is called k-means.First, you specify in advance how many clusters are being is the parameter k.Then k points are chosen at random as cluster centers.All instances are assigned to their closest cluster center according to the ordinary Euclidean distance metric.
The diagram shown in Figure(2) represents the K-means algorithm.

Figure
Figure (2): Steps of k-means Algorithm 3.1.1.2.The EM (Expectation Maximization) Algorithm The EM (Expectation-Maximization) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates.It can be viewed as an extension of the k-means paradigm, which assigns an object to the cluster Figure (3): Distribution of the Original Data under Different Attributes

Figure
Figure (5): Curve Fitting Using Least Square Method to Cluster 2 of k-means when X-axis Represents the Disease and the Y-axis Represents Age Attribute