Celebrating dbt | StellarGlyph

In two short months Fivetran has acquired two of the biggest names in the data world. dbt labs and Tobiko Data. dbt in particular, since it's release in March 2016m has become an integral (some might say de facto) part of the modern data stack. It rapidly gained market share from more established players such as Alteryx, Matillion and Informatica because it was open source and brought modern software engineering principles to the data world and that it was released under an opens source licence.

Now, despite both companies' Press Releases that commit to keeping dbt under an open-source licence, Fivetran's preference for private model and aggressive pricing strategy has left a lot of us fearful of the loss of dbt. So much so that Oliver Laslett, CTO at Lighdash named the huge increase of forks of the dbt-core repo in the wake of the merger announcement as the dbt Fear Index. dbt is so prolific in modern data teams that I don't think he's wrong. Me personally, I love dbt but I'm also excited to see what comes next. The world now is very different to the world that dbt was originally built in, and I'm sure there are more than a few projects out there who will vie to take the crown from the fallen king.

Rather than be sad, I'd like to celebrate what dbt got right, and why it was and still remains such an amazing tool. Most of the below features fall under the banner of bringing modern software engineering practices to the data world. By doing this, it introduced a level of rigour and professionalism to even small teams. It was opinionated on what constituted good practice and built a community around this tool. Whoever vies to take the crown from dbt needs to recognise also that the below will be the baseline from which they start.

SQL First, YAML Second, Jinja Third

Despite all the buzz around Python and Rust, SQL is still the most popular programming language. Data folks are SQL native and dbt recognised this by allowing them to write data transformations in pure SQL adding Jinja templating for more advanced programming. dbt allows you to write data transformations using SQL SELECT statements, which it then compiles and runs in your data warehouse. This design decision meant that all transformations ran on the data warehouse layer. All transformations are done through data models and stored in separate files. This allowed the ability to enforce some structure to data projects and made all transformations visible. No more hunting through a huge list of stored procedures to understand how a particular table was generated. All configuration, documentation and most testing is specified through YAML files. Though not the most elegant this reduces the barrier to configuring, documenting and testing and makes it really easy to keep these updated as the platform grows.

Data Lineage

Relationships between models and different stages of a data processing pipeline are established using ref() and source() functions. This is the core of dbt's lineage tracking. Instead of hardcoding table names in your SQL, you use {{ ref('model_name') }}. When dbt compiles your project, it resolves each ref() to the actual table/view name in your warehouse; records the dependency relationship between models and builds a directed acyclic graph (DAG) of how models relate to each other

The source() function Similar to ref(), {{ source('source_name', 'table_name') }} defines dependencies on raw data tables. This captures where your transformations begin and tracks lineage from source systems through to final models. dbt parses all your model files and constructs a dependency graph automatically which determines the correct order to run models (upstream models run first).

Even better, all of this information is available to users through the manifest.json file which contains all of this information and can serve it up visually through the documentation website.

Testing

Test driven development has become the de facto standard in the software engineering world and with dbt's testing framework, it made it super simple for data folks to adopt similar testing practices in their projects. By providing a simple and flexible testing framework driven through YAML files and SQL custom tests, it allowed users to write tests easily without worrying about learning about another testing framework. This meant that on every run, users could run simple tests which alerted on data quality issues and failures. By making this a native feature, dbt reduced the barrier to entry, even providing 4 basic tests out of the box: unique, not_null, accepted_values and relationships.

Adding data quality tests to your platform was as simple as adding the below config to your project.

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'inactive', 'pending']

Materialisation

dbt separates out the materialisation of your data from your modelling logic. This means that the modelling of data is independent of how you choose to persist that data in the warehouse. Of course this means you can transform your view into a table with the simple change of some configuration and not worry about creating DDL statements. You could choose to use one of the 4 native (ephemeral, view, table or incremental) out of the box but if you had some niche requirements, you could also create your own.

In particular, the incremental materialisation greatly simplified the processing of large tables. No longer did you need to create complex logic just to process the latest data, again specifying configuration in the config block would tell dbt which columns to use to identify the latest data. This helps keep cost downs on processing large tables. Ultimately materialisations give the user the flexibility to optimise each model individually based on data volumes, query patterns and freshness requirements.

{{ config(
    materialized='table'
) }}
 
select *
from {{ ref('source_model') }}

DevOps

As DevOps has done for software engineering it has done the same for dbt, devops practices maintain trust by protecting production datasets from developers. dbt projects are code repositories. All changes to business logic, tests and documentation are tracked by version control. These changes can then be reviewed by peers through standard code review processes. Once this code is in version control, managing deployments to different environments is the logical next step.

CI/CD pipelines allowed automated and dependable runs. Each change to the codebase could run through a series of this allows formatting and verification steps to ensure all code met minimum standards. All these features make dbt projects manageable at scale with proper testing, deployment automation, and collaborative workflows

Documentation

All of the previous features could have been locked in a silo without the strong documentation features dbt provides. A simple but powerful documentation website can be generated through two commands which makes the modelled business logic transparent and discoverable. The website contains an interactive website with a data lineage visualisation of the whole project; searchable catalogue of all models, sources and macros; detailed information about all data models; column level details including data types and the compiled SQL for each model. This documentation very quickly becomes a central knowledge base for the data team (and even stakeholders if it is hosted externally). This greatly reduces onboarding time and makes it easier for non-technical stakeholders to understand available data assets. It's particularly valuable because it has to follow the same change management process as the actual code - as models change, documentation can also be updated automatically.

Extensibility

dbt's packaging features enable code reusability and collaboration across projects and teams. Like in software engineering, using packages allows data teams to build on top of work done by their peers their. For example rather than having to manually create dbt tests, use of a package like dbt-expectations could allow one to set up quite detailed tests just by importing them into the project.

What next?

Going through this list, it makes me realise how successful a project dbt was. It solved so many issues data teams had wrestled with and tried to solve individually. It erred towards simplicity and ease of use and it really helped introduce so many great software engineering practices to data teams. No matter what happens in the future, whatever new technology we move to, it can't help but be inspired by dbt and to succeed it needs to provide many of the same features- it's what the people expect.

The king is dead,

Long live the king.