Most companies starting with AI don't have a modeling problem. They have a data pipeline problem. The pattern is familiar: an engineer writes a Python script to pull data from an API, does some light cleaning, and exports the result for analysis or training. It works at first. But over time, these scripts accumulate, each one solving a narrow problem, none of them designed as part of a cohesive system. Eventually, no one fully trusts the data, and no one wants to touch the scripts.
This is what I call the spaghetti script problem, and it's one of the biggest blockers to building reliable AI systems.
Fragile Ingestion: Where Problems Begin
Early pipelines often start as simple scripts like this:
import requests
import pandas as pd
url = "https://api.stripe.com/v1/customers"
response = requests.get(url, headers={"Authorization": "Bearer ..."})
data = response.json()["data"]
df = pd.DataFrame(data)
df.to_csv("customers.csv", index=False)This works, until something changes. Maybe the API introduces pagination, authentication expires, or the script fails silently during execution. Now your dataset is incomplete, and your model is training on partial data.
The core issue isn't code quality. It's architecture. These scripts aren't designed for reliability, monitoring, or recovery. This is why many teams adopt dedicated ingestion tools like Fivetran or Airbyte; they handle schema changes, retries, and synchronization automatically, loading data into a central warehouse like BigQuery or Snowflake. The goal is simple: ensure the warehouse contains a complete, reliable copy of the source systems. Without that foundation, everything downstream becomes questionable.
Raw Data Isn't Model-Ready
Once data lands in the warehouse, it still isn't ready for machine learning. Application databases are designed for transactional efficiency, not learning; they rely heavily on IDs and normalized schemas. For example, a raw orders table might look like this:
select
order_id,
customer_id,
status_code,
created_at,
total_amount
from raw.ordersOn its own, this table doesn't tell you much. What kind of customer is this? Is this their first order or their fiftieth? Are they active or dormant? Machine learning models learn from patterns in behavior, not isolated events. To extract those patterns, you need transformation.
Transformation: Turning Events into Entities
This is where tools like dbt come in. Instead of working with fragmented tables, you build models that represent meaningful business entities. For example:
-- models/marts/customer_orders.sql
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
)
select
customers.customer_id,
customers.email,
count(orders.order_id) as total_orders,
sum(orders.total_amount) as lifetime_value,
max(orders.created_at) as last_order_date,
date_diff(current_date, max(orders.created_at), day)
as days_since_last_order
from customers
left join orders
on customers.customer_id = orders.customer_id
group by 1, 2This model transforms raw transactions into behavioral features. Instead of individual orders, you now have customer-level summaries, and this is the level where machine learning becomes effective.
Feature Engineering
One of the biggest misconceptions about AI is that performance comes primarily from model selection. In reality, feature quality usually matters more. A raw dataset might include customer_id and order_timestamp, but a useful feature is orders_last_30_days:
select
customer_id,
count(*) as orders_last_30_days
from orders
where order_timestamp >= current_timestamp - interval 30 day
group by customer_idThis feature captures behavior over time, the kind of behavioral signal that allows models to make accurate predictions. Without it, even advanced models have limited signal to learn from.
Standardization
If every forge stamps a different mark on every shield, the commanding officers can't tell which units are armed and which aren't. The same problem shows up in your data when different teams use different strings for the same concept.
A common example is inconsistent categorical data. The same concept appearing as:
Paid Socialpaid socialPSpdsoc
These values represent the same concept but appear differently in the raw data. A transformation model can standardize them:
case
when lower(channel) in ('paid social', 'ps', 'pdsoc')
then 'Paid Social'
when lower(channel) in ('organic search', 'seo')
then 'Organic Search'
else 'Other'
end as channel_groupWithout this step, your model treats each variation as an unrelated concept, a silent error that corrupts every analysis downstream.
Data Quality Requires Enforcement, Not Assumptions
Even well-designed pipelines degrade over time. New values appear, null rates increase, and upstream systems change. Modern pipelines include tests to detect these issues early. For example, a dbt test might enforce uniqueness:
models:
- name: customer_orders
columns:
- name: customer_id
tests:
- unique
- not_nullThese tests act as guardrails; they catch problems early, before they affect models or analytics.
What AI-Ready Data Actually Means
AI-ready data isn't defined by volume. It's defined by structure and reliability. Before training a model, you should be able to answer yes to three questions:
- Is the data reliably ingested? Can you trust that it's complete?
- Is the data organized around meaningful entities? Customers, sessions, products, not just events.
- Does the dataset include behavioral features? Aggregations, recency, and trends, not just raw logs.
If the answer is no, the bottleneck isn't your model. It's your pipeline.
AI systems don't begin with modeling. They begin with data engineering. The teams that actually ship reliable AI products aren't the ones chasing the newest model releases. They're the ones who spent six weeks upfront building boring, reliable ingestion and transformation before writing a single line of training code.
Your model can only learn what you show it. Spaghetti scripts don't teach. They confuse.
