Data Integration in Data Mining
Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”.
- Here, a data warehouse is treated as an information retrieval component.
- In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading.
- Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand, and then sends the query directly to the source databases to obtain the result.
- And the data only remains in the actual source databases.
Issues in Data Integration:
There are no issues to consider during data integration: Schema Integration, Redundancy, Detection, and resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
- Integrate metadata from different sources.
- The real-world entities from multiple sources are matched referred to as the entity identification problem.
- An attribute may be redundant if it can be derived or obtaining from another attribute or set of attributes.
- Inconsistencies in attributes can also cause redundancies in the resulting data set.
- Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
- This is the third important issue in data integration.
- Attribute values from different sources may differ for the same real-world entity.
- An attribute in one system may be recorded at a lower level abstraction than the “same” attribute in another.