ETL Testing: October 2022

Monday, 31 October 2022

What is mean by SCD and Types of SCD?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD Type 0 :

This refers to attributes that never change and will not be updated in the data warehouse.

for eg. date of birth or an employee's start date with a company

SCD Type 1 :

History is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values

SCD Type 2 :

A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys.

SCD Type 3 :

A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing.

SCD Type 4 :

Provides a solution to handle the rapid changes in the dimension tables. The concept lies in creating a junk dimension or a small dimension table with all the possible values of the rapid growing attributes of the dimension.

SCD Type 6 :

It is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.

What is a junk dimension?

A Junk Dimension is a dimension table consisting of attributes that does not belong in the fact table or in any of the existing dimension tables.

The nature of these attributes is usually text or various flags.

for e.g. non-generic comments or just simple yes/no or true/false indicators.

ETL Test Scenarios

Below scenarios are validated as a part of ETL testing :

Table structure verification :

The column name, data type and data length of the target table will be verified against the requirement
constraint’s check. We ensure that all required constraints are available.

Index check :

We ensure that the index are created with the required columns.

Source data validation :

Record the source table count and ensure that there won’t be any junk or bad data exits.

Data count check :

Comparing the target data count against the source data count along with business logic.
Ensure that source data moved correctly to the target table by comparing data.

Duplicate data validation :

Ensure that the duplicate records are not present in source and target table.

Data with primary key and foreign key check :

Test the primary key and foreign key relationship with different test data for the parent and child table.

Null check :

Inject the data with NULL for a NOT NULL column and verify that data will be rejected.

Also check whether data present in source but if null in target.

Data precision check :

Create test data in the source table with different precisions and ensure the loaded data has precision as per requirement.

Date format check :

All date columns are loaded in the defined date format or not.

ETL Testing Lifecycle/Phases

The different phases of ETL testing are as follows :

Requirement analysis :

Input to start testing is data model and mapping document, we need to make sure that the source table or files provided by developer are correct as per mentioned requirement.

Test planning :

We need to prepare test plan based on requirements. Every company has their own template for test plan.

Test Design :

Test cases will be prepared based on the mapping document and data model.

In this stage we need to analyze source data and mapping documents such as data type, data length and relationships and prepare queries.

Test Execution :

Once the environment setup done, data is available for testing and all the required test scripts is ready then first round of execution can be performed with the bulk load jobs and all the stages from source to target will be tested one by one. During testing if you observe any issue then log a defect.

Defect retesting :

Fixed defects will be re-tested in this phase. Based on impact analysis the test cases need to be executed as part of a defect fix.

Test Closure/Sign off :

Based on the exit criteria of test execution the sign-off mail is to be sent to stakeholders to be proceeded to push the code to the next level.

Types of load in ETL?

Full load (bulk load)

The data loading process when we do it the very first time. It can be referred to as bulk load or fresh load.

Incremental load

The modified data alone will be updated in target followed by a full load. The changes will be captured by comparing created or modified date against the last run date of the job.

What is mean by Dimensions?

A Dimension is a descriptive data which describes the key.

The Dimensions are organized in a table called Dimension tables.

Dimension tables are De-normalized.

There is one-to-many relationship between dimensions

Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table.

for eg. Customer, Territory, Employee, Product etc.

What is mean by Dimensional Data Modeling?

Dimensional data model is most often used in data warehousing systems.

In Datawarehouse, a schema is used to define the way to organize the system with all the database entities (fact and dimension tables) and their logical association.

In designing data models for data warehouses / data marts, the most commonly used schema types are as below :

Star Schema :

A fact table in the center surrounded by multiple dimension tables.

Snowflake Schema :

A fact table in the center surrounded by multiple hierarchies of dimension tables.

Galaxy Shema / Fact Constellation Schema :

Multiple fact tables share the same dimension tables. The arrangement of fact tables and dimension tables looks like a collection of stars in the Galaxy schema model.

Star Cluster Schema :

Combination of Star and Snowflake schema.

Whether one uses a star or a snowflake largely depends on business needs.

What are the types of Data Model?

Conceptual Data Model:

This Data Model defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules.

Logical Data Model:

Defines HOW the system should be implemented regardless of the DBMS. This model is typically created by Data Architects and Business Analysts. The purpose is to developed technical map of rules and data structures.

Physical Data Model:

This Data Model describes HOW the system will be implemented using a specific DBMS system. This model is typically created by DBA and developers. The purpose is actual implementation of the database.

Why Data Model?

Ensures that all data objects required by the database are accurately represented.

A data model helps design the database at the conceptual, physical and logical levels.

It helps to define the relational tables, primary and foreign keys and stored procedures.

It provides a clear picture of the base data and can be used by database developers to create a physical database.

What is Data Model?

The Data Model is defined as an abstract model that organizes data description, data semantics, and consistency constraints of data.

The data model emphasizes on what data is needed and how it should be organized instead of what operations will be performed on data.

What is ETL testing?

ETL testing is done to ensure that the data that has been loaded from a source to the destination after business transformation is accurate. It also involves the verification of data at various middle stages that are being used between source and destination.

Testing can also help identify and prevent issues with data quality during the ETL process, such as duplicate data or data loss.

What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing.

It includes historical data derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for data modeling and analysis.

The key features of Datawarehouse are :

- Subject-oriented

- Integrated

- Time variant

- Non-volatile

Types of Database

Centralized Database

Distributed Database

Relational Database

NoSQL Database

Cloud Database

Hierarchical Database

Network Database

Personal Database

Operational Database

Enterprise Database

What is Database?

A database is an organized collection of data, so that it can be easily accessed and managed. The main purpose of the database is to operate a large amount of information by storing, retrieving, and managing data.

There are many databases available like MySQL, Sybase, Oracle, MongoDB, PostgreSQL, SQL Server, AWS (Amazon Web Services), Snowflake Computing, etc.

SQL is a standard language for storing, manipulating, and retrieving data in databases.

ETL Testing