What is dbt?
The dbt is an open-source conversion tool that aims to simplify the work of the analytical engineer in the flow of data pipeline work. Specifically uses only T in the ETL process. The biggest feature is that it focuses on implementing software principles that are common not only in the software industry but now also introduced in the analytical world.
- Model: SQL selection statement that takes the tables as inserted, modified, and output the result as a table.
- Package: a dbt project that is a set of SQL tasks.
- Source: source is a table in the data repository where conversion begins.
- Exposure: representation of the lower-level use of the dbt project.
In most data pipelines, less important attachment to version control, testing, and documentation, while this is an important factor in maintaining a pipeline. Significant improvements in time management are seen when an error in the dashboard can be quickly detected by a thorough inspection and a data list graph instead of wasted hours finding the problem. The test will be the first to inform developers that something has failed instead of the client using the dashboard to detect that something has been turned off.
- Dbt has two versions: dbt Cloud and dbt Command Line Interface (CLI). While dbt Cloud has a UI that takes care of a few configurations and a preview of the transformed data can be seen, the same commands as in CLI need to be applied to dbt Cloud. Due to the lack of UI, dbt is not for business analysts who prefer to use tools and make plans as small as possible. Dbt is part of a global transformation technology focused on transformation.
- Dbt is very useful for large data groups as it is built for a specific role in the data pipeline process: an analytical engineer. This is a special new role for large data groups that only make a difference. Dbt may be used to meet all the needs of an analytical engineer. Other roles such as data engineer and data analyst and taking care of data lake management, extraction, and upload of data in the database and business information.
Documentation is essential when updating the data pipeline with a new feature or if another developer takes over the project. Metadata from the different tables and their relations are extracted by dbt and summarized in a clear overview.
- The different models contain modified information about specific details such as type and package, description about the table, different columns with their names, types and definitions and what tests are available. Also, dependence on other tables or references from other tables can be found. Both the source code and the combined code are stored here.
- Different packages are included with additional functionality. Note: open source community in dbt hub already has many solutions for analytics problems available in packages.
- Sources where the first stage model was developed, can be specified and documented. This source for example would be a Delta table loaded with Databricks.
- Different dashboards use tables of fact and size, which can be specified and defined in exposures. Note, this is just a representation, the data will not automatically flow to the dashboard.
2. Data Lineage:
Data lineage graph is an important factor in showing where your data comes from and where it is going. Many data groups look at a list of visual data to display data flow from the source table in the data warehouse, to various tables after conversion, finally to the dashboard that brings insights to the data.
3. Version Control:
Version control with git with a chosen git repository to store the code. Models, tests, exposures, sources, configurations of the project, and the different used packages are versioned.
4. Incremental transformations:
- Incrementally transforming the data is necessary for handling big data.
- The table can be set up as incremental and some jinja code can be added so that the filter can be applied only to the incremental run.
- To avoid duplicate rows, a unique key can be set. In order not to lose a lot of data with a strict cut-off, a window can be used. But if the data comes later than the time window, the data will be lost.
Things to remember while using Incremental Table:
- A column representing time or counting at the top as a serial key.
- A row with a unique key, or a combination of rows that are unique to avoid duplicate rows.
- When updates are made in rows, a column represents the time at which the row is updated to use this row as the cutting point where the data needs to be converted.
Testing is important in ensuring data quality and early detection of problems. Dbt provides ways to integrate testing into the data pipeline. Some tests have already been shipped on dbt, and some can be found in the open-source package on dbt Hub.
There are two types of tests: Bespoke testing & Generic testing
- Bespoke tests: Bespoke testing is a measurable test that can be reused on different models. Note: the number of allowed failed rows can be set before giving an error. It is also possible to write failed rows in a table for reference.
- Generic tests: Generic testing is tested on certain models and if the lines are not returned, the test is successful. Any number of tables can be combined to test a specific table.