Entity Identification Problem in Data Mining
Nowadays, data mining is used in almost all places where a large amount of data is stored and processed. Data Integration is one of the major tasks of data preprocessing. Integration of multiple databases or data files into the single store of identical data is known as Data Integration. Data Integration is usually performed to create data sets for machine learning algorithms and to predict the statistical information from the data during the data mining. We integrate data from various resources like banking transactions, invoices, customer records, Twitter, blog postings, image, audio or video data, electronic data interchange (EDI) files, spreadsheets, and sensor data.
Data mining often requires data integration, the merging of data from multiple data stores. which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration like Schema integration and object matching.
So a careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process. The semantic heterogeneity and structure of data pose great challenges in data integration. How can we match schema and objects from different sources? Or How can equivalent real-world entities from multiple data sources be matched up? This problem is known as the entity identification problem.
Data is usually collected from multiple resources into a coherent store and it can be of different dimensions and datatypes. There are different representations of data and different scales of data.
Issues in Data Integration:
- Data redundancy: Redundant data occurs while we merge data from multiple databases. If the redundant data is not removed incorrect results will be obtained during data analysis. Redundant data occurs due to the following reasons.
- Object identification: The same attribute or object may have different names in different databases
- Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
- Duplicate data attributes: Duplicates are usually present in the information contained in one or more other attributes.
- Irrelevant attributes: Some attributes in the data are not important and they are not considered while performing the data mining tasks. There is no use in having such irrelevant attributes in the data. For example, students’ ID is often irrelevant to the task of predicting students’ GPA
- Entity Identification Problem: Equivalent real-world entities from multiple data sources matched up are referred to this problem. Entity Identification Problem occurs during the data integration. During the integration of data from multiple resources, some data resources match each other and they will become reductant if they are integrated. For example: A.cust-id =B.cust-number. Here A, B are two different database tables .cust-id is the attribute of table A,cust-number is the attribute of table B. Here cust-id and cust-number are attributes of different tables and there is no relationship between these tables but the cust-id attribute and cust-number attribute are taking the same values. This is the example for Entity Identification Problem in the relation. Meta Data can be used to avoid errors in such schema integration. This ensures that functional dependencies and referential constraints in the source system match in the target system. Entity Identification Problem helps in detecting and resolving data value conflicts.
Data integration techniques:
- Manual Integration
- Middleware Integration
- Application-based Integration
- Uniform Access Integration
- Data warehousing