In the previous post I described how to load Parquet exports into a local DuckDB database — a fast, free, file-based data warehouse you can have running in an afternoon. The raw tables are queryable straight away. But to turn them into something reliable, documented, and ready for an analytics environment, you need a transformation layer. That is what dbt Core provides.
What dbt does
dbt (data build tool) takes the messy work of SQL transformations — renaming columns, joining tables, applying business logic, calculating metrics — and turns it into a versioned, testable, documented codebase. Each transformation is a plain .sql file. dbt compiles them, resolves dependencies, and runs them in the right order. The result is a set of clean, analysis-ready tables and views in your database.
The open-source version, dbt Core, is free and runs entirely from the command line. A DuckDB adapter connects it to your .duckdb file with minimal configuration.
Installing dbt Core and the DuckDB adapter
Install both into your Python environment (Conda recommended):
pip install dbt-core dbt-duckdb
Verify the installation:
dbt --version
Initialising the project
Navigate to the folder where you want the dbt project to live — not inside the DWH folder — and run:
dbt init your_project_name
The wizard will ask for a project name (lowercase letters, digits, and underscores only) and a database type — select duckdb. This creates the project folder structure and writes a starter profiles.yml to ~/.dbt/profiles.yml.
Configuring the connection
Open ~/.dbt/profiles.yml and replace the contents with:
your_project_name:
target: dev
outputs:
dev:
type: duckdb
path: /absolute/path/to/your/dwh.duckdb
The path must be the full absolute path — relative paths will not work reliably across terminal sessions. Test the connection with:
dbt debug
All checks should be green: profiles.yml valid, dbt_project.yml valid, git found, connection successful.
Project folder structure
After initialisation, the project looks like this:
your_project_name/
├── models/
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── tests/
├── macros/
├── seeds/
├── snapshots/
└── dbt_project.yml
The models/ directory is where all transformation logic lives. Everything else supports it.
The three-layer architecture
dbt projects conventionally follow a three-tier model structure. The tiers are not enforced by dbt itself — they are a well-established pattern that keeps transformations clean, maintainable, and easy to reason about.
1. Staging (models/staging/)
One model per source table. The only job of a staging model is to clean up the raw data: rename columns to a consistent snake_case convention, cast data types where needed, and make the data structurally consistent. No business logic belongs here.
Before writing staging models, register your raw tables in a sources.yml file. This tells dbt where to find the raw data and enables the `` macro, which powers lineage tracking:
# models/staging/sources.yml
version: 2
sources:
- name: main
description: "Raw DWH tables loaded from Parquet exports"
tables:
- name: customers
- name: contacts
- name: orders
- name: invoices
- name: products
A staging model then looks like this:
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM
),
renamed AS (
SELECT
id AS order_id,
"customerId" AS customer_id,
date AS order_date,
status,
"totalAmount" AS total_amount,
"createdAt" AS created_at,
"updatedAt" AS updated_at
FROM source
)
SELECT * FROM renamed
The `` macro is more than a table reference — it creates an explicit dependency that dbt uses to build the lineage graph and surface data freshness information.
2. Intermediate (models/intermediate/)
Intermediate models sit between staging and marts. They handle joins, enrichments, and any logic that is reusable across multiple final outputs but not meaningful enough to expose directly to consumers. Think of them as named building blocks.
-- models/intermediate/int_orders_with_customers.sql
SELECT
o.order_id,
o.order_date,
o.status,
o.total_amount,
c.customer_name,
c.customer_type
FROM o
LEFT JOIN c ON c.customer_id = o.customer_id
The macro — like — creates tracked dependencies. dbt will always run referenced models before the models that depend on them.
3. Marts (models/marts/)
Marts are the final, analysis-ready outputs. They are designed for consumption: each mart answers a specific analytical question cleanly, at the right grain, with meaningful column names. These are what BI tools, notebooks, or other analytics environments connect to.
Marts are typically materialised as tables (not views) so that query performance in the consuming tool is fast and predictable:
-- models/marts/revenue_by_month.sql
SELECT
DATE_TRUNC('month', order_date) AS month,
customer_type,
COUNT(order_id) AS order_count,
ROUND(SUM(total_amount), 2) AS revenue
FROM
WHERE status = 'paid'
GROUP BY 1, 2
ORDER BY 1, 2
Running models
dbt run # build all models
dbt run -s staging # build only the staging layer
dbt run -s +marts # build marts and all their upstream dependencies
Adding tests
dbt has a built-in testing framework. Column-level tests are defined in a schema.yml file alongside the models:
# models/staging/schema.yml
version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
Run all tests with:
dbt test
Failed tests show exactly which rows violated which constraints — invaluable for catching data quality issues before they reach your analytics layer.
Generating documentation
dbt generates a full documentation site from your models, sources, and schema definitions:
dbt docs generate
dbt docs serve
The docs site includes a lineage graph that shows every model, source, and mart as nodes, with edges showing how data flows between them. It is the clearest visual representation of what the data warehouse actually does — and it stays up to date automatically as models change.
Working in Positron with dbt Power User
Positron is a data science IDE from Posit, built on the VS Code engine. The dbt Power User extension (by Altimate Inc.) integrates deeply with dbt Core and makes the day-to-day experience noticeably better:
- A lineage panel shows the upstream and downstream dependencies of whatever model you have open
- Model previews let you run a model and inspect the output directly in the IDE
- Autocomplete for
andmacros - Run and test buttons so you rarely need to switch to the terminal
Open the dbt project folder in Positron (File → Open Folder) and the extension auto-detects dbt_project.yml. The setup wizard walks through selecting your Python interpreter (the environment where dbt is installed) and validating the connection.
Version control with Git
dbt init creates a .gitignore that already excludes the right folders (target/, dbt_packages/, logs/). Initialise the repository and make an initial commit:
git init
git add .
git commit -m "Initial dbt project setup"
After each meaningful change — a new model, a new test, updated documentation — commit via the terminal or Positron’s built-in Source Control panel. Keeping the history clean makes it easy to see what changed and why, and to roll back if something breaks.
One important note: never commit profiles.yml to Git. It lives in ~/.dbt/ outside the project folder and contains local file paths. It should stay local.
Refreshing the data
When new Parquet exports arrive, a full refresh takes three steps:
./setup_dwh.sh # rebuild the raw DWH from new parquet files
dbt run # rebuild all transformed models
dbt test # verify data quality
The entire cycle — from fresh exports to tested, analysis-ready mart tables — runs in seconds for datasets of typical small-to-medium size. No cloud credits, no orchestration infrastructure, no maintenance overhead. For any project where data arrives periodically as flat-file exports, this stack is hard to beat.