don’t collapse overnight. They grow slowly, query by query.
“What breaks when I change a table?”
A dashboard needs a new metric, so someone writes a quick SQL query. Another team needs a slightly different version of the same dataset, so they copy the query and modify it. A scheduled job appears. A stored procedure is added. Someone creates a derived table directly in the warehouse.
Months later, the system looks nothing like the simple set of transformations it once was.
Business logic is scattered across scripts, dashboards, and scheduled queries. Nobody is entirely sure which datasets depend on which transformations. Making even a small change feels risky. A handful of engineers become the only ones who truly understand how the system works because there is no documentation.
Many organizations eventually find themselves trapped in what can only be described as a SQL jungle.
In this article we explore how systems end up in this state, how to recognize the warning signs, and how to bring structure back to analytical transformations. We’ll look at the principles behind a well-managed transformation layer, how it fits into a modern data platform, and common anti-patterns to avoid:
To understand the “SQL jungle” we first need to look at how modern data architectures evolved.
Historically data engineers built pipelines that followed an ETL structure:
Extract --> Transform --> Load
Data was extracted from operational systems, transformed using pipeline tools, and then loaded into a data warehouse. Transformations were implemented in tools such as SSIS, Spark or Python pipelines.
Because these pipelines were complex and infrastructure-heavy, analysts depended heavily on data engineers to create new datasets or transformations.
Modern architectures have largely flipped this model
Extract --> Load --> Transform
Instead of transforming data before loading it, organizations now load raw data directly into the warehouse, and transformations happen there. This architecture dramatically simplifies ingestion and enables analysts to work directly with SQL in the warehouse.
It also introduced an unintended side effect.
In the ELT architecture, analysts can transform data themselves. This unlocked much faster iteration but also introduced a new challenge. The dependency on data engineers disappeared, but so did the structure that engineering pipelines provided.
Transformations can now be created by anyone (analysts, data scientists, engineer) in any place (BI tools, notebooks, warehouse tables, SQL jobs).
Over time, business logic grew organically inside the warehouse. Transformations accumulated as scripts, stored procedures, triggers and scheduled jobs. Before long, the system turned into a dense jungle of SQL logic and a lot of manual (re-)work.
In summary:
ETL centralized transformation logic in engineering pipelines.
ELT democratized transformations by moving them into the warehouse.
Without structure, transformations grow unmanaged, resulting in a system that becomes undocumented, fragile and inconsistent. A system in which different dashboards may compute the same metric in different ways and business logic becomes duplicated across queries, reports, and tables.
In this article we use a transformation layer to manage transformations inside the warehouse effectively. This layer combines the engineering discipline of ETL pipelines while preserving the speed and flexibility of the ELT architecture:
The transformation layer brings engineering discipline to analytical transformations.
When implemented successfully, the transformation layer becomes the single place where business logic is defined and maintained. It acts as the semantic backbone of the data platform, bridging the gap between raw operational data and business-facing analytical models.
Without the transformation layer, organizations often accumulate large amounts of data but have difficulty to turn it into reliable information. The reason being that business logic tends to spread across the platform. Metrics get redefined in dashboards, notebooks, queries etc.
Over time this leads to one of the most common problems in analytics: multiple conflicting definitions of the same metric.
If the core problem is unmanaged transformations, the next logical question is:
What would well-managed transformations look like?
Analytical transformations should follow the same engineering principles we expect in software systems, going from ad-hoc scripts scattered across databases to “transformations as maintainable software components“.
In this chapter, we discuss what requirements a transformation layer must meet in order to properly manage transformations and, doing so, tame the SQL jungle.
Instead of large SQL scripts or stored procedures, transformations are broken up into small, composable models.
To be clear: a model is just an SQL query stored as a file. This query defines how one dataset is built from another dataset.
The examples below show how data transformation and modeling tool <em>dbt</em> creates models. Each tool has their own way, the principle of turning scripts into components is more important than the actual implementation.
Examples:
-- models/staging/stg_orders.sql
select
order_id,
customer_id,
amount,
order_date
from raw.orders
When executed, this query materializes as a table (staging.stg_orders) or view in your warehouse. Models can then build on top of each other by referencing each other:
-- models/intermediate/int_customer_orders.sql
select
customer_id,
sum(amount) as total_spent
from {{ ref('stg_orders') }}
group by customer_id
And:
-- models/marts/customer_revenue.sql
select
c.customer_id,
c.name,
o.total_spent
from {{ ref('int_customer_orders') }} o
join {{ ref('stg_customers') }} c using (customer_id)
This creates a dependency graph:
stg_orders
↓
int_customer_orders
↓
customer_revenue
Each model has a single responsibility and builds upon other models by referencing them (e.g. ref('stg_orders')). This approach has has major advantages:
This structured system of transformations makes transformation system easier to read, understand, maintain and evolve.
A managed system stores transformations in version-controlled code repositories. Think of this as a project that contains SQL files instead of SQL being stored in a database. It’s similar to how a software project contains source code.
This enables practices that are pretty familiar in software engineering but historically rare in data pipelines:
Instead of editing SQL directly in production databases, engineers and analysts work in a controlled development workflow, even being able to experiment in branches.
Another key capability a managed transformation system should provide is the ability to define and run data tests.
Typical examples include:
These tests validate assumptions about the data and help catch issues early. Without them, pipelines often fail silently where incorrect results propagate downstream until someone notices a broken dashboard
A managed transformation framework also provides visibility into the data system itself.
This typically includes:
This dramatically reduces reliance on tribal knowledge. New team members can explore the system rather than relying on a single person who “knows how everything works.”
Another common pattern introduced by managed transformation frameworks is the ability to separate transformation layers.
For example, you might utilize the following layers:
raw
staging
intermediate
marts
These layers are often implemented as separate schemas in the warehouse.
Each layer has a specific purpose:
This layered approach prevents analytical logic from becoming tightly coupled to raw ingestion tables.
With the previous chapters, it becomes clear to see where a managed transformation framework fits within a broader data architecture.
A simplified modern data platform often looks like this:
Operational systems / APIs
↓
1. Data ingestion
↓
2. Raw data
↓
3. Transformation layer
↓
4. Analytics layer
Each layer has a distinct responsibility.
Responsibility: moving data into the warehouse with minimal transformation. Tools typically include custom ingestion scripts, Kafka or Airbyte.
Responsible for storing data as close as possible to the source system. Prioritizes completeness, reproducibility and traceability of data. Very little transformation should happen here.
This is where the main modelling work happens.
This layer converts raw datasets into structured, reusable analytical models. Typical tasks consist of cleaning and standardizing data, joining datasets, defining business logic, creating aggregated tables and defining metrics.
This is the layer where frameworks like dbt or SQLMesh operate. Their role is to ensure these transformations are
Without this layer, transformation logic tends to fragment across queries dashboards and scripts.
This layer consumes the modeled datasets. Typical consumers include BI tools like Tableau or PowerBI, data science workflows, machine learning pipelines and internal data applications.
These tools can rely on consistent definitions of business metrics since transformations are centralized in the modelling layer.
Several tools attempt to address the challenge of the transformation layer. Two well-known examples are dbt and SQLMesh. These tools make it very accessible to just get started applying structure to your transformations.
Just remember that these tools are not the architecture itself, they are simply frameworks that help implement the architectural layer that we need.
Even when organizations adopt modern data warehouses, the same problems often reappear if transformations remain unmanaged.
Below are common anti-patterns that, individually, may seem harmless, but together they create the conditions for the SQL jungle. When business logic is fragmented, pipelines are fragile and dependencies are undocumented, onboarding new engineers is slow and systems become difficult to maintain and evolve.
One of the most common problems is business logic moving into the BI layer. Think about “calculating revenue in a Tableau dashboard”.
At first this seems convenient since analysts can quickly build calculations without waiting for engineering support. In the long run, however, this leads to several issues:
Instead of being centralized, business logic becomes fragmented across visualization tools. A healthy architecture keeps business logic in the transformation layer, not in dashboards.
Another common anti-pattern is writing extremely large SQL queries that perform many transformations at once. Think about queries that:
These queries quickly become difficult to read, debug, reuse and maintain. Each model should ideally have a single responsibility. Break transformations into small, composable models to increase maintainability.
Avoid mixing transformation responsibilities within the same models, like:
Without separation between layers, pipelines become tightly coupled to raw source structures. To remedy this, introduce clear layers such as the earlier discussed raw, staging, intermediate or marts.
This helps isolate responsibilities and keeps transformations easier to evolve.
In many systems, data transformations run without any form of validation. Pipelines execute successfully even when the resulting data is incorrect.
Introducing automated data tests helps detect issues like duplicate primary keys, unexpected null values and broken relationships between tables before they propagate into reports and dashboards.
One of the most fragile patterns is modifying SQL directly inside the production warehouse. This causes many problems where:
In a good transformation layer, transformations are treated as version-controlled code, allowing changes to be reviewed and tested before deployment.
Not every data platform needs a fully structured transformation framework from day one. In small systems, a handful of SQL queries may be perfectly manageable.
However, as the number of datasets and transformations grows, unmanaged SQL logic tends to accumulate. At some point the system becomes difficult to understand, maintain, and evolve.
There are several signs that your organization may be reaching this point.
When these symptoms begin to appear, it is usually time to introduce a structured transformation layer. Frameworks like dbt or SQLMesh are designed to help teams introduce this structure while preserving the flexibility that modern data warehouses provide.
Modern data warehouses have made working with data faster and more accessible by shifting from ETL to ELT. Analysts can now transform data directly in the warehouse using SQL, which greatly improves iteration speed and reduces dependence on complex engineering pipelines.
But this flexibility comes with a risk. Without structure, transformations quickly become fragmented across scripts, dashboards, notebooks, and scheduled queries. Over time this leads to duplicated business logic, unclear dependencies, and systems that are difficult to maintain: the SQL jungle.
The solution is to introduce engineering discipline into the transformation layer. By treating SQL transformations as maintainable software components — version controlled, modular, tested, and documented — organizations can build data platforms that remain understandable as they grow.
Frameworks like dbt or SQLMesh can help implement this structure, but the most important change is adopting the underlying principle: managing analytical transformations with the same discipline we apply to software systems.
With this we can create a data platform where business logic is transparent, metrics are consistent, and the system remains understandable even as it grows. When that happens, the SQL jungle turns into something far more valuable: a structured foundation that the entire organization can trust.
I hope this article was as clear as I intended it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics.
Happy coding!
— Mike