Distance-Based Outlier Detection in Data Mining
An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining.
An outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different manner. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining.
Outliers are isolated data objects that do not follow the general trends of data. Outliers cause many problems when we are working in machine learning or deep learning and it also affects the accuracy of the model. So detecting and removing outliers is going to be very important.
Outlier Detection is a natural extension of data mining techniques. As Data Mining is the extraction of general patterns or trends in large datasets, outlier detection is the discovery of data objects that deviate significantly from such general patterns or trends. Such data objects that deviate significantly from other data objects in a dataset are known as outliers.
For types of outliers please refer: Types of Outliers in Data Mining
Finding data objects that are significantly different from other objects is an important activity. By standing out from the crowd, outliers could represent objects that are in some way, much better or much worse than the general trend. They may represent objects that need to be dealt with in some special manner. It is also possible that they represent erroneously entered data or even noise.
Take a look at the best example of how outlier detection help in Data Mining.
- Fraud detection: Fraud detection is very important in the modern world. As fraud cases are going increase day by day like a fraud in a credit card transaction, bank loan application, and many more, Outlier detection help us in detecting this fraud as an outlier since they represent instance that deviates from the normal trends.
- Medicine: In healthcare detecting outlier play’s an important i.e Unusual Symptoms or test results may indicate potential health problems of patients. and there are many other applications of outlier detection in Data Mining.
Distance-Based Outlier Detection Methods
A distance-based outlier detection method consults the neighborhood of an object, which is defined by a given radius. An object is then considered an outlier if its neighborhood does not have enough other points. This is termed as Distance-Based Outlier Detection Methods.
- Distance-Based Methods usually depend on a Multi-dimensional Index, Which is used to retrieve the neighborhood of each object to see if it contains sufficient points. If there are insufficient points, then the object is termed an outlier.
- Distance-Based methods scale better to multi-dimensional space and can be computed more efficiently than the statistical-based method. Identifying Distance-based outliers is an important and useful data mining activity. The main disadvantage of distance-based methods is that distance-based outlier detection is based on a single value of a custom parameter. This can cause significant problems if the dataset contains both dense and sparse regions.
Outlier detection methods can be categorized according to whether the sample of data for analysis is given with expert-provided labels that can be used to build an outlier detection model. In this case, the detection methods are supervised, semi-supervised, or unsupervised. Alternatively, outlier detection methods may be organized according to their assumptions regarding normal objects versus outliers. This categorization includes statistical methods, proximity-based methods, and clustering-based methods.
Algorithms For Mining Distance-Based Outliers:
Below are some algorithms which are used for Mining Distance-Based Outlier more efficiently.
- Index-based algorithm: The index-based algorithm facilitates multidimensional indexing structures, including R-trees or k-d trees, to search for neighbors of each object o inside radius d around that object. Once K (K = N(1-p)) neighbors of object o are discovered, it is accessible that o is not an outlier. This algorithm has the lowest case complexity of O (k * n2), where k is the dimensionality, and n is the number of objects in the data set.
- Nested-loop algorithm: The nested loop algorithm has the same evaluation complexity as the index-based algorithm but avoids building index structures and minimizes the amount of I/O. It splits the memory buffer in half and puts the data into several logical blocks.
- Cell-based algorithm: It avoids the O(n2) computational complexity and develops a cell-based algorithm for memory-resident datasets. Its complexity is O(c*k + n), where c is a constant based on the number of cells and k is the dimension.