dbt Primer #
A bite sized chunk of knowledge for using dbt (data build tool).
Installation #
First up, install dbt and dbt tools from the command line.
pip install dbt
pip install dbt-core
We’ll need to also specific adapters for the type of data platform (BigQuery, Snowflake, Spark, etc.):
Check the documentation for the types of supported data platforms.
Let’s say we’re using BigQuery:
pip install dbt-bigquery
Check version with:
dbt --version
Profile #
The profile defines how to connect a data platform.
This exists in the home directory’s .dbt/ directory in file profiles.yml. The top line should match the project name (from above).
(As always, check out the documentation for more details).
For a BigQuery implementation, the profile might look something like:
<project name>:
target: test
outputs:
test:
type: bigquery
method: service-account
keyfile: <path>.json
project: <project id>
dataset: dbt_<name>
threads: 1
timeout_seconds: 300
location: US
priority: interactive # or interactive
dev:
...
prod:
...
Where ^ project name would relate to the GCP project name you’ve set up, dataset is the BigQuery dataset name (within the database).
To check that the connection works, run dbt debug. If everything works, the tests should pass.
Projects #
To create a dbt project:
dbt init <project name>
You’ll be asked to select from a list of choices the data platform you want to use (based on the adapters that you previously installed.)
Once it gets created, there will be a skeleton full of files and directories.
The skeleton usually starts off with the following folders:
-
analysis -
data -
macros- user-defined functions that be be re-used -
models- core foundation of dbt for data models -
snapshots -
tests
As well as a dbt_project.yml file.
dbt_project.yml - every project needs one. This file dictates how the project works.
This ^ will include a chunk that handles models (see below). The + is recursive downstream.
models:
<project>:
<models sub-directory>:
+materialized: view # or table
<models sub-sub-directory, assuming there's no + in the parent dir specification>:
materialized: table
Models #
Basically a bunch of SQL files. Each file is a model (table).
models folder in the project contains all the SQL files. This is truly the heart of dbt.
Common formats int he SQL files are:
with <something> as (
select ...
)
select *
from <something>
Models can reference other models. For instance, if there’s something called model_one.sql, another sql file could include something like:
select *
from {{ ref(model_one) }}
where id = <some unique identifier>
Staging #
Run #
dbt run commmand line run executes models in the models directory.
For a full refresh:
dbt run --full-refresh
For select updates of particular models:
dbt run --select <model name(s)>
To run everything that builds up to a particular target model:
dbt run --seclect +<model_name>
Materialization #
Specification of how something gets created, such as as a table.
Jinja #
Macros galore.
Testing #
Validate that the models are working correctly.
schema.yml – version: 2 at the top
Beneath that,
models:
- name: <model/table name>
columns:
- name: <name>
tests:
- unique
- not_null
Run tests with:
dbt test
Documentation #
To build dbt documentation:
dbt docs generate
Launch it locally with:
dbt docs serve
There’s a handy graph to show the relationship between the models.
Debug #
Look for errors from within the dbt project directory:
dbt debug
This will raise errors if any are found.