This is what I call Spaghetti Analytics. It works for a team of one, but it breaks for a team of ten.
Working with raw SQL scripts is like trying to navigate Hogwarts on your first day. The stairs move, doors lead to different places depending on the day, and nothing is where it seems. Every time a column name changes upstream, your old path leads into a solid wall. dbt is the Marauder's Map. Because of the ref() function, it knows exactly where every room is at all times. If a room moves, the map updates automatically. And like a sentry, dbt's tests flag bad data before it reaches the business users.
To build a modern, scalable data stack, you need to stop treating SQL like a scripting language and start treating it like software engineering. That is what dbt (data build tool) enables: SQL transformations developed using software engineering practices like modularity, testing, documentation, and Git-based version control.
The Layer Cake Architecture
The biggest shift in dbt is moving away from single, monolithic queries toward modular models. Instead of one query that goes from raw data to final dashboard, dbt encourages splitting logic into distinct layers, which improves readability, testing, and reuse.
Staging
Clean and standardize source tables. Avoid joins and aggregations at this layer. Focus entirely on cleaning. This creates a reliable version of each source table that all downstream models can trust.
-- models/staging/stg_stripe__payments.sql
with source as (
select * from {{ source('stripe', 'charges') }}
),
cleaned as (
select
id as payment_id,
order_id,
amount / 100.0 as amount_usd,
created as created_at,
status
from source
)
select * from cleanedThis model fixes naming, types, and units once, so downstream models don't have to repeat that logic.
Intermediate
Apply joins and business logic. This is where relationships between entities are defined.
select *
from {{ ref('stg_stripe__payments') }} payments
left join {{ ref('stg_orders') }} orders
on payments.order_id = orders.order_idThe ref() function is one of dbt's most important features. It tells dbt how models depend on each other, and it automatically builds models in the correct order, creating a dependency graph behind the scenes. Intermediate models are typically not exposed directly to BI tools, but instead feed downstream models.
Marts
Create wide, denormalized tables organized by business concept: customers, orders, subscriptions. These are the tables analysts actually query.
select
customer_id,
count(order_id) as total_orders,
sum(amount_usd) as lifetime_value,
max(created_at) as last_order_date
from {{ ref('int_customer_orders') }}
group by customer_idThese models act as the foundation for dashboards and analytics. Instead of forcing analysts to join five tables, the logic is already centralized.
Tests for Data, Not Just Code
In software engineering, tests protect against bad logic. In data engineering, tests protect against bad data. dbt allows you to define tests that run as part of your pipeline:
version: 2
models:
- name: stg_stripe__payments
columns:
- name: payment_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['succeeded', 'failed', 'refunded']These tests ensure your assumptions remain true. If a test fails, you know immediately, which prevents bad data from silently propagating into dashboards and models.
Snapshots: Tracking Historical Changes
Most application databases only store the current state. If a customer moves from New York to London, the old value is overwritten, which breaks historical reporting. dbt provides built-in support for Slowly Changing Dimension Type 2 using snapshots. The result looks like this:
| customer_id | city | valid_from | valid_to |
|---|---|---|---|
| 101 | New York | Jan 1 | Feb 15 |
| 101 | London | Feb 15 | NULL |
Both rows are preserved, so you can always answer "What city was this customer in on January 15th?", a question that's impossible to answer if you only store the current state.
Lineage: Understanding Dependencies
Because dbt models reference each other using ref(), dbt builds a full dependency graph. A live map showing every connection, every dependency, every footprint. You can answer "If I change this model, what breaks?" before you touch a single file.
Materializations: Controlling How Models Are Built
Materializations define how dbt stores your models in the warehouse. This is a core concept that affects both performance and cost.
{{ config(materialized='table') }}
select * from {{ ref('stg_stripe__payments') }}Common materializations include:
- view: Always reflects the latest data, rebuilt on every query
- table: Physically stored and fully rebuilt on each dbt run
- incremental: Processes only new or changed records, essential for large datasets
{{ config(materialized='incremental') }}
select *
from raw.events
{% if is_incremental() %}
where created_at > (select max(created_at) from {{ this }})
{% endif %}Incremental models allow pipelines to scale efficiently, instead of rebuilding everything, dbt only processes new records.
Real World Example: The One Big File Problem
The most common anti-pattern I see when working with dbt and BigQuery is the "One Big File." A single SQL file handles cleaning, joins, aggregation, and formatting, making debugging extremely difficult. When something breaks, everything breaks. You can't isolate the problem.
Enforcing layers solves this. Problems become isolated: cleaning issues live in staging, logic issues live in intermediate, aggregation issues live in marts. This dramatically improves both reliability and developer velocity.
Engineering Over Scripting
If your business logic lives in disconnected SQL scripts, your data platform will eventually break. dbt gives you a different way to build it:
- Modular models
- Reusable logic
- Automated testing
- Version-controlled transformations
In other words: software engineering. This is the difference between a collection of SQL queries and a real data platform.
