Blog

Data vault automation for smaller teams: Building on the shoulders of giants

Patrick Van Deven

Apr 22, 2026

What if your team of ten could run a data vault like the best in the industry?

The architecture review meeting is one of the most predictable scenes in enterprise data. The data engineering lead has done his homework. He wants a proper integration layer: historisation, auditability, parallel loading, the ability to bring on new sources without rewriting everything that already exists. He is right on the merits. Nobody in the room disputes that.

The head of analytics is skeptical. She tried to hire a specialist architect last quarter and the recruiter came back with three candidates, two of them unavailable and the third one expensive enough to make the CFO flinch. The CTO has a team of ten and is thinking about what happens in month four when the person who designed the model goes on paternity leave and nobody else can read the satellite splits.

The meeting ends without a decision. Everyone agrees on what the platform should do, but nobody is confident the team can pull it off. The decision gets pushed to next month, and next month it gets pushed again, and eventually someone suggests they just use dbt and figure it out as they go.

The architecture itself was never the hard part. Data Vault, star schema, and lakehouse patterns each serve a different purpose in the stack, and experienced teams know how to combine them. The hard part is having the team who can design, build, and operate a proper integration layer at enterprise scale. That is where most organisations stall: not on which architecture to pick, but on whether they have the people to execute it.

Some teams skip the debate entirely. They build pipelines with as little integration as possible, precisely to avoid the complexity and maintenance overhead that a formal modelling approach requires. On the surface that looks pragmatic. In practice, it pushes the integration problem downstream to the business consumer: the analyst, the reporting team, the person building the dashboard who now has to reconcile conflicting definitions of "customer" or "revenue" across six different source extracts. The engineering team avoids the hard work upfront, and the people least equipped to deal with it inherit the mess.

Both paths lead to the same underlying problem. The architecture that would serve these organisations best is the one they are afraid to adopt, because the expertise required to build and operate it has historically been concentrated in a small number of very large, very well-funded organisations.

Where the expertise actually lives

Think about who runs a well-governed integration layer today at genuine scale. Banks with hundreds of source systems and dedicated centres of excellence. Insurance companies with regulatory requirements that make historisation mandatory. Global enterprises where the data platform underpins products consumed by millions of professionals, designed by architects who have spent decades refining how enterprise metadata should be structured, integrated, and governed.

Those teams have something that the ten-person team in the architecture review does not: accumulated context. Thousands of modelling decisions made over years, encoded in patterns that the team carries collectively. Which entity types work best for which source patterns. How to handle slowly changing dimensions in a satellite split. Where multi-active satellites make sense and where they create more problems than they solve. When to use a link versus a same-as link. How to structure bridge tables for query performance without sacrificing the audit trail.

That knowledge has been built up across hundreds of enterprise deployments worldwide, by the teams who did the hardest implementations in the most demanding environments. The problem is that until now, it stayed with those teams. You could read books and attend training courses, but the real expertise, the pattern recognition that comes from having seen it go wrong in forty different ways across forty different customers, lived in the heads of a few hundred senior practitioners scattered across the industry.

When the balance shifts

That is what AI changes. Not the architecture. The economics of who can execute it.

A platform that has processed the metadata of hundreds of enterprise deployments carries accumulated context that no individual team could build on their own. Encoded in that metadata are the patterns, the edge cases, the resolutions that worked and the ones that did not. When that context becomes available to an agentic system, the balance between the sophistication required and the size of the team shifts fundamentally.

The ten-person team does not need to have seen forty different implementations. They do not need to know from experience that a particular source pattern works better as a non-historised link than as a standard satellite. That knowledge exists in the accumulated metadata, and agents that reason from it can surface it at exactly the moment the team needs it.

The team still makes the decisions. The agent proposes, the architect reviews, adjusts, and approves. But the conversation in the room is different. Instead of starting from "do we even have the skills to attempt this?", it starts from "here is what the system suggests based on what has worked across similar deployments. Do we agree, or do we want to change it?"

What this looks like in practice

The natural question is: what does an agent-assisted workflow actually look like? Not in a pitch deck, but in the kind of decisions data teams face every week.

Consider the design phase. A team is modelling a new source and needs to decide which entities qualify as business entities, which in a Data Vault translate to hubs with satellites, grouped into hub groups for master data. That decision depends on the taxonomy: does the implementation happen at the Party level, or further down at Person and Organisation? The answer is different for every customer and every source, and getting it wrong means restructuring the vault later.

Then there are multi-active satellites. Sometimes they are the right answer, for genuinely multi-temporal problems where an entity has multiple concurrent versions of the same attribute set. Other times they are a symptom of bad modelling, and a pre-staging step would solve the problem more cleanly. Knowing which situation you are in requires experience across many implementations, because the symptoms look similar and the consequences diverge sharply.

Or take the same-as link in the Business Vault: the mechanism for connecting matched customer records across sources with a matching probability. Getting this right requires understanding both the technical pattern and the business rules for identity resolution, which vary by industry, by regulator, and by how much ambiguity the organisation is willing to tolerate.

These are the decisions that separate a good Data Vault from a fragile one. They are also exactly the kind of decisions where accumulated context across many deployments makes the difference. An agent that has seen how hundreds of teams resolved the same patterns can surface the relevant precedents at the moment the architect needs them. It does not make the call. It shows what has worked before, flags the tradeoffs, and lets the team decide.

On the downstream side, the same dynamic plays out. A data product owner describes what they need in business terms: specific attributes, specific filters, specific aggregation logic. The agent maps the request against the business model, traces which entities and physical columns are involved, flags where the request requires derived logic rather than a direct attribute pull, and generates the semantic view definition. The team reviews a structured proposal instead of building one from scratch.

The humans still make every judgement call. What changes is that the assembly work, the cross-referencing, and the precedent research that used to consume most of the time happens in minutes rather than weeks.

One of our customers is already doing something close to this on the downstream side. Their team took the metadata from our platform and started feeding it to an LLM to generate custom transformation templates for their data consumers. A use case that would have taken weeks of manual coding, they built in two days. Not because the LLM was unusually capable. Because the metadata was rich enough to give it something real to reason from. The business keys, the entity relationships, the integration patterns: all accumulated context, all structured, all immediately useful to an AI that needed to understand what the data meant before it could write code that was correct.

What "standing on shoulders" actually means

The best data platform implementations in the world were not built in a week. They were built over years, by some of the strongest data architects in the industry, making thousands of decisions that refined their models into systems that serve the most demanding data environments on the planet.

We have seen what this looks like in practice, working with data teams at large insurance companies, global financial services firms, manufacturing enterprises, and media organisations. The patterns that worked, the decisions that had to be revised, the edge cases that only surface at scale: all of it feeds into the context that AI agents now carry into every new project. A ten-person team with the right agent-assisted platform can make modelling decisions informed by the same depth of experience that a fifty-person centre of excellence at a global bank took a decade to build.

That changes the calculation in the architecture review. The question is no longer "do we have the people who know how to do this?" It becomes "do we have the people who can review what the agent proposes and make good judgement calls?" That is a much more accessible bar. You still need smart engineers who understand data. You do not need them to have spent ten years exclusively inside Data Vault implementations.

A board advisor of ours put it well recently. He said the mistake is defining your target market by company size. What matters is IT sophistication: the team's ability to evaluate recommendations and make informed decisions, even if they have never built the thing themselves. A ten-person data team at a mid-sized insurer can be highly sophisticated. A hundred-person team at a multinational can be surprisingly unsophisticated if they have been running on manual processes and tribal knowledge for twenty years. The agent does not care how many people are on the team. It cares whether someone can look at a proposed satellite split and say "yes, that is right" or "no, change it, and here is why."

Why structure pays off twice

Methodologies that emphasise metadata, structured integration patterns, explicit business key definitions, and auditable historisation create precisely the kind of rich, governed context that AI agents need to perform well. A well-structured integration layer is, in effect, a knowledge graph waiting to be activated. The metadata is already there, the business decisions are already encoded, and the lineage is already traceable. All it needs is an agent layer that can read it, reason from it, and extend it.

Methodologies that skip the metadata, that favour speed over structure, produce schemas that are harder for agents to reason about. Less context, more ambiguity. The AI still works, but from a thinner foundation, which means more errors, more human intervention, more time spent fixing what the agent got wrong. And the teams that skipped integration altogether, building minimal pipelines and pushing reconciliation to the business users? They have the thinnest foundation of all. The AI has almost nothing to reason from, because nobody encoded the decisions in the first place.

The very thing that made a proper integration layer feel heavyweight (the upfront investment in structure and metadata) turns out to be what makes it the best foundation for AI-assisted data engineering. The discipline pays off twice: once in the quality of the platform, and again in the quality of the agent-assisted work that builds on top of it.

Recent research from Stanford's IRIS Lab (Lee et al., "Meta-Harness," March 2026) puts a number on this. They measured what happens when you change only the scaffolding around a fixed AI model, the code that decides what context to store, what to retrieve, and how to present it, while keeping the model itself identical. The performance difference was 6x on the same benchmark. Not a marginal improvement from a better prompt. A structural gap driven entirely by the quality of the context.

For data vault programmes, the implication is straightforward. Two teams using the same underlying language model will get fundamentally different results depending on what context their agents can draw from. A team working from a governed knowledge graph that encodes business entity classifications, integration strategies, and domain boundaries will consistently outperform a team that points the same model at raw table structures and column names.

A reasonable question: can a good data engineer with access to a large language model build something equivalent? Partially, for a while. Source profiling, proposing a mapping, generating SQL for a handful of tables: an engineer with Claude or GPT can do this today. But scaling that across dozens of source systems, with governed audit trails, deterministic lineage, confidence scoring on every mapping, and human sign-off workflows, is a different problem entirely. The DIY approach starts from zero every time. A platform that carries the accumulated decisions from hundreds of enterprise deployments does not.

Back to the architecture review

The ten-person team. The sceptical analytics lead. The CTO doing the maths.

The answer is different now. You do not need a fifty-person centre of excellence or three specialist architects at a premium. You need a platform that carries the accumulated context of the organisations that already did the hard work, and agents that make that context available to your team, on your schedule, at your scale. What used to require rare, expensive expertise is becoming accessible to any team with the sophistication to review and decide.

The model-driven approach makes that possible, and the data products that follow are where the business sees the return.

VaultSpeed is running an early access programme for the agentic framework with selected enterprise customers across financial services, insurance, telecom, and manufacturing. If your organisation has an existing data vault investment and wants to explore what the agentic layer could deliver, we would welcome the conversation.

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert

It's time to 10x your data delivery

VaultSpeed automates the transformation of data scattered across dozens of source systems into governed, production-ready pipelines, native to your cloud data platform.

Talk to an expert