Why to Use Surrogate Keys and Slowly Changing Dimensions in Star Schema?
The surrogate key is a unique identifier of a row/record in a relational table. It is added to each dimension table which does not include a single unique column to support star modeling. Examples of a surrogate key in dimension table:
Dimension: Customer Attributes: Customer_id (Primary key and surrogate key) Name Age Gender Address Email Marital Status city
Here, “Customer_id” is the surrogate key. It is not a piece of useful information about individual customers but gives a unique identification to every individual customer when stored in the databases.
Use of surrogate keys in the Star Schema:
- Business keys usually have a business meaning in OLTP systems, such as Social Security Number for employees. Hence, these are tied to the business settings and requirements. If this type of business key would change then all the tables which are using these business keys will also need to be updated. Hence, we avoid this problem and just add surrogate keys to each dimension table. Surrogate keys are just unique integers and will never have to change.
- Compare to surrogate keys, business keys are usually bigger in size which will result in big indexes and slow down index traversal and thus query execution time increases. Therefore, to make them efficient and faster systems, we avoid business keys and use instead surrogate keys.
- Business keys are often re-used over longer periods of time. Thus, the same business key will not uniquely identify a record in the table.
- Surrogate keys can also be successfully used to deal with slowly changing dimensions.
Therefore, we don’t use unique business keys as the primary keys. We prefer surrogate keys as the primary keys in each dimension table.
Slowly Changing Dimensions (SCD):
- These are the attributes of dimension tables that slowly change over time, not frequently nor periodically.
- SCD contains both current and historical data.
- Marital Status of the customer, Email, and mobile number of the customer.
In OLTP systems, overwriting of records is allowed. But in data warehouse systems historical data is stored for analysis purposes. So, this data can not be overwritten.
Types of SCDs:
Type-1 SCD: In Type -1 SCD Overwriting is allowed. Here, New data overwrites the existing data. Existing data is lost and not stored anywhere else. This is the default type of dimension we create. No need to specify any additional information to create them. Example: Email and Marital_status of the customer dimension. Marital status does not change frequently. It changes very slowly over time once or twice.
The dimensions will always contain the current values for each attribute and historical data is lost.
Type-2 SCD: It creates another dimension record. It retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active. Three extra columns should be added to the dimension table: 1. Start_Date. 2. End_Date, 3. Active_Flag. Example: Salary of the Employees Dimension table. If salary changes overtime then it is not overwritten rather an active time period is added to old salary information.
Type-3 SCD: Creating a current value field. Stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any selected attribute changes then the current value is stored in previous_column and the new value is stored in current_value_column. Example: City of the customer dimension. If a customer changes his city over time, then we neither overwrite the city name nor keep the active flag for that. We simply create two rows stating what was the old city and what is the new city.
Note: SCD-1 approach is commonly used for supplementary columns such as email id, mobile number of customers. SCD-2 is not a good approach for Rapidly Changing Dimensions. Hence, separate rapidly changing attributes by implementing junk dimensions.
Junk dimension: The junk dimension can be defined as a place to store junk attributes.The collection of random transactional codes, flags, etc. that are not related to any particular dimensions. Attributes such as delivered, shipped, received, packed, and returned can have indicator flags like Yes/No, Y/N, True/False. It reduces the size of the fact table.
Factless table: A fact table that does not contain any fact or measure. It contains only the primary key from its dimensional tables. It is used to resolve many-to-many cardinality issues.