Challenges of Outlier Detection in Data Mining
Outlier Detection means finding out the data objects whose properties and behavior are different from the rest of the objects in the cluster or the data sets. Outlier Detection is the process of finding the outliers from the normal objects. It is essential to perform the Outlier Detection during the data preprocessing. Outliers highly affect the performance of the classification and clustering models. There are many outlier detection methods in data mining. Some of them are as follows:
- Proximity-based methods
- Grid-based methods
- Distance-based methods
- Clustering-based methods
There are a few challenges while applying these outlier detection methods.
For more details please refer to the Types of Outliers article.
The challenges of outlier detection methods in data mining are listed below.
- Modeling normal outliers effectively: The quality of Outlier detection depends on the modeling of normal (which are not an outlier) objects. Often, building a model for finding the data normality is very challenging and maybe impossible because it is hard to determine all the behavioral properties of the normal objects. It is difficult to predict the border between normal outliers and abnormal outliers. some outlier detection methods distinguish the outliers by assigning each input data to object a label as either “normal” or “outlier” .while some other methods use the score measure as the factor to decide whether the object is an outlier. Based upon the application consistency and its data type the outlier detection method is chosen.
- Application-specific outlier detection: The relationship model is dependent on the type of application and it describes the normal data objects characteristics. Different applications require different types of data as input and require various modeling and analysis algorithms. Example: In clinical data analysis, a small deviation of data values reflects the choice of an outlier. In contrast, in marketing analysis, a larger deviation of data values is needed to justify an outlier. Choosing the Outlier detection method depends on the application type. We need to find out the outliers from a vast variety of applications data so the data types of these data sets may vary. There is no unique outlier detection method for all the applications.
- Handling the noise in outlier detection: Noise is usually present in all the data sets. Noise is present in outliers also. But there is a misassumption that noise and outliers are the same. The noise makes the quality of the data set to be imperfect. Noise often occurs when the data is collected from many resources and applications. Noise in the data sets is caused due to the duplicate tuples, missing values, and deviation of data attributes. Noise in the data sets makes the data-poor and it becomes a huge challenge to outlier detection. If noise is present in the data then it becomes difficult to retrieve the normal objects and separate the outliers from the data sets. Missing values may hide outliers and reduce the chance of detection of outliers.
- Understandability: In some cases, a client requires the condition of why a particular object has become an outlier as it may be useful for the process of applications. There must be a specific conditional criterion and justification to distinguish the normal objects from the outliers. And that justification must be well formulated. and understandable. Example: It is clear to understand the proximity outlier detection as the normal objects have nearly the proximity measures where as the outliers differ a large in their proximity measure.