Blog

Data foundations for AI: why the transformation layer is the weakest link

Jonas De Keuster

Apr 20, 2026

Every enterprise with an AI programme has heard the same advice: get your data foundations for AI in order. Clean it up. Govern it.

McKinsey reported this month that eight in ten companies cite data limitations as a roadblock to scaling agentic AI. That number does not surprise anyone who works in enterprise data. What surprises people is where, specifically, the foundation is breaking.

It is not ingestion. Getting data from source systems into a cloud platform is a solved problem. Fivetran, Airbyte, Kafka, native platform connectors. The plumbing between source and landing zone works. The consumption layer is in good shape too. Dashboards, semantic layers, BI tools, copilots have improved more in two years than in the previous ten.

The weakness is in the middle. The logic that takes distributed, multi-source records and turns them into an integrated, governed, historically complete view of the enterprise. That layer, the one that connects raw ingestion to usable data products, is where AI foundations fail.

What the transformation layer actually is

In a Medallion architecture, the transformation layer sits at Silver, between Bronze (raw ingestion) and Gold (data products, metrics, KPIs). Its job is integration: resolving conflicting definitions across source systems, enforcing consistent business key logic, maintaining historical records, and producing a single governed representation of the enterprise's data.

For organisations running Data Vault 2.0 at Silver, this means hubs, links, and satellites. A methodology designed to absorb new sources without redesigning existing structures, where historisation is the default rather than an afterthought.

The transformation layer is where business meaning meets technical metadata. Where someone decides that "customer" in the CRM and "client" in the policy system are the same entity, or different ones. Where a team defines how a Guidewire claim maps to the enterprise reporting structure. Where historisation rules determine whether you can answer a question about what was true six months ago.

When that layer is well-built, everything downstream works. AI agents have traceable lineage. Business users get consistent answers. Auditors can follow the logic from source to report. When it is built from hand-coded scripts that nobody fully documents, the entire data foundation is structurally compromised.

Why AI exposes the problem

AI did not create the transformation layer problem. It made the problem impossible to ignore.

Before AI, the consequences of poorly governed transformation logic were absorbed by human analysts. A finance team member would know, from experience, that the revenue figures in the operational report used a different definition than the ones in the board pack. They would adjust. They would add context. Tribal knowledge compensated for the structural gap.

AI does not carry tribal knowledge. Point an AI agent at a raw data vault schema and ask it to build a customer profitability report. It will find columns that look like customer identifiers and pick the first plausible match. It will join tables by inference, sometimes writing speculative transformation code to bridge gaps it does not understand. It will deliver an answer with complete confidence.

It will not check whether two source systems conflict on what "customer" means, or whether "region" refers to billing address or sales territory, or whether finance calculates profitability differently from operations.

The answer will look right. And nobody will know it is wrong until someone who carries the tribal knowledge happens to check.

This is not a hallucination problem. It is a context problem. The AI is doing exactly what you asked, reasoning over the data it can see. The problem is that what it can see is a set of tables and columns with no machine-readable record of what they mean, how they were derived, or which source wins when two sources disagree.

With AI, the cost of a poorly governed transformation layer stops being an operational drag and becomes a multiplier of errors at machine speed. An AI agent on a weak foundation does not produce slowly wrong answers. It produces rapidly wrong answers with high confidence. And it produces them at scale, across every team that touches the data, simultaneously. Each new AI use case builds on the same fragile foundation and discovers the gaps independently, because there is no shared record of what was decided and why.

Agentic AI makes this worse, not better. Agents that act autonomously across data sources need governed context even more than copilots that assist a human. Without it, they do not just suggest wrong answers. They execute on them.

That is transformation debt. And it compounds faster than any other form of technical debt, because it sits at the intersection of every business change, every platform migration, and every new regulatory requirement.

What data foundations for AI actually require

The phrase "AI-ready data" gets used loosely. It usually means clean, accessible, well-catalogued. Those are necessary conditions. They are not sufficient.

For the transformation layer to support AI reliably, four things need to be true.

Deterministic lineage

Every transformation from source to business product must be traceable, not reconstructed by parsing code after the fact, but derived from an explicit model that captures what was intended. When an AI agent asks "where did this number come from," the answer should be a chain of governed steps, not a manual code inspection exercise.

Machine-readable business semantics

This is the one most teams think they have covered, and almost none actually do. AI agents do not read documentation pages. They need structured, queryable metadata: what each business entity means, how it relates to other entities, which source systems contribute to it, how conflicts between sources are resolved. YAML descriptions and markdown files are human-readable. They are not structured enough for a machine to reason over at enterprise scale.

Source-change traceability

Enterprise data is not static. Source schemas change as ERPs get migrated, SaaS platforms update their APIs, and acquisitions bring in new systems. When a source changes, every downstream transformation, data product, and AI query that depends on it must be identifiable. If that traceability does not exist, a single source change can silently invalidate months of AI-generated analysis.

Governed integration logic

When two source systems describe the same business concept differently (different keys, different naming conventions, different update cadences), the resolution logic must be explicit and governed. Who decided which source wins? When was that decision made? Is it still valid? If the integration logic is buried in hand-written SQL that one person understands, the AI is building on sand. This is the requirement that makes people uncomfortable, because the honest answer in most enterprises is "we don't know."

Where existing approaches fall short

Most of the content written about data foundations for AI focuses on infrastructure: cloud platforms, data lakes, governance tools, ingestion pipelines. That framing misses the structural issue.

Hand-built pipelines are the primary source of the problem. They create transformation debt by design. Business logic gets hard-coded into SQL scripts, stored procedures, and ETL jobs. Documentation drifts or never gets created. Tribal knowledge concentrates in a few people. Each new source system adds a layer nobody fully maps.

dbt has professionalised transformation code. Version control, testing, modularity, lineage: real improvements. But dbt is code-first. Logic lives in hand-written SQL. Business meaning is manually maintained in YAML files. When a source schema changes, every affected model must be manually identified and rewritten. dbt's lineage reflects what was written, not what was intended. For AI agents that need to understand business semantics, code-derived lineage is not enough.

Platform-native features (Snowflake Dynamic Tables, Databricks Delta Live Tables, Fabric Dataflows) reduce infrastructure complexity. They solve the "how do I run this" problem. They do not solve the "how do I know what this means, where it came from, and what happens when a source changes" problem. And they create platform lock-in for transformation logic that should be portable.

Data catalogs (Collibra, Alation, Atlan) govern and discover. They are essential. But catalogs document what exists. They do not design or generate the transformation logic. A catalog can tell you that a table exists and who owns it. It cannot tell you whether the transformation that produced it still reflects the current business rules, or whether a source change last month silently broke the lineage.

None of these approaches address the structural cause: transformation logic that is hand-coded, undocumented, person-dependent, and resistant to change.

The model-first alternative

There is a different architectural approach. Instead of writing transformation code and then trying to document and govern it after the fact, you start from an explicit model of the business (entities, relationships, mappings) and generate the code from that model.

Business meaning is captured in a structured, machine-readable conceptual model. Source metadata is harvested and mapped to business concepts. Transformation code, whether native SQL, Spark SQL, or dbt models, is generated from the model rather than written independently of it.

This inverts the relationship between logic and code. The model becomes the system of record. The code is a derived artifact, regenerable and auditable from the model that produced it.

The trade-off is real: you have to build the model. That is not a trivial exercise. Mapping business entities across source systems, defining integration rules, getting domain experts and architects to agree on what "customer" means for the enterprise as a whole. This takes effort upfront that a code-first approach defers. The difference is that deferred effort comes back as transformation debt. The upfront effort becomes a persistent, compounding asset.

Here is what that means in practice. When a source schema changes, you do not discover the impact when downstream models break. The model detects the change, surfaces every affected object, and regenerates. When the target platform changes (Snowflake to Databricks, on-prem to cloud), the model stays the same and the code regenerates for the new platform. Migration becomes a controlled regeneration exercise, not a multi-year rewrite.

For AI, this changes the data foundation at its core. Lineage is deterministic and model-derived, reflecting what the business intended rather than what the code happens to do. Business semantics are structured and queryable by construction. Source changes propagate through the model automatically. And the accumulated context of years of modelling decisions becomes available to AI agents as a knowledge graph they can reason over, rather than tribal knowledge locked in a few people's heads.

The practical difference shows up in cost and quality simultaneously. We tested this directly: same data set, same AI analytics queries. First pass, raw tables with no lineage and no vault. The AI wrote speculative Python, joined large tables by inference, consumed heavy compute credits, and returned answers that looked correct but could not be verified. Second pass, the same data vaulted and backed by a governed semantic layer explaining how the data came to be. The AI returned the answer directly, from known definitions and confirmed join paths, with full traceability. Fewer tokens, fewer credits, verifiable results. Structured context is not just more accurate. It is cheaper to run.

What this looks like at enterprise scale

A large insurance company manages 36 source systems and approximately 2,500 data objects with a core team of four people using model-driven automation. In a hand-built environment, that scope typically requires 15 to 20 specialists. To put that concretely: four people are doing the work that would otherwise need a department. Not because they work harder, but because the model captures every mapping, every business key definition, every integration rule. That metadata is the foundation the team uses to onboard new sources, absorb regulatory changes, and maintain complete lineage. When regulators ask how a number was derived, the answer is already in the model.

A financial services organisation manages 17 source systems with a team of six. Complex multi-source landscape, frequent regulatory reporting obligations, and a small team that is well-supported by the model layer doing the structural work humans should not be doing by hand at that scale.

A global enterprise with 100+ business systems, continuous M&A activity, seven enterprise domains, and 11 live data products is running a two-year transformation programme. On schedule. Each acquisition's data gets onboarded through the model rather than hand-coded from scratch. That is the only reason the timeline holds. In a hand-built environment, the third acquisition would have broken the programme.

This last case is the one that tends to shift the conversation. Most data leaders have lived through an M&A integration that went sideways because nobody could map the acquired company's data into the existing warehouse fast enough. The model-first approach does not make M&A easy. It makes it tractable.

The model-first approach also extends into the two transitions that have historically been manual: source analysis before the transformation layer (profiling systems, identifying business keys, resolving naming conflicts) and governed data products after it (semantic layers, metrics, data contracts). AI agents grounded in the knowledge graph can propose initial mappings upstream and generate semantic definitions downstream, with human review and sign-off at every step. Production code generation stays deterministic. No language model in the code path.

The question to ask before your next AI initiative

Most enterprises evaluating AI on their data stack start with the model: which LLM, which agent framework, which copilot.

The better starting point is the transformation layer. How many people on your team can explain how your transformation logic works end to end? What happens when one of them leaves? When a source schema changed last quarter, how long did it take to trace the downstream impact? When your AI copilot gives a confident answer, can you verify the lineage from source to output?

If the answers are uncomfortable, the problem is not the AI. It is the data foundation underneath it. And the specific part of that foundation that matters most is the transformation layer.

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert