DataPartner365

Jouw partner voor datagedreven groei en inzichten

Data Vault 2.0: De Moderne Enterprise Data Warehouse Architectuur

Gepubliceerd: 20 december 2024
Leestijd: 14 minuten
Data Vault 2.0, Data Modeling, Enterprise Data Warehouse, Hubs Links Satellites, Agile Data Warehousing
Expert niveau: Intermediate tot Advanced

Een diepgaande gids voor Data Vault 2.0 methodology, ontwikkeld door Dan Linstedt. Leer hoe je enterprise data warehouses bouwt die schaalbaar, flexibel en agile zijn, met volledig historie beheer en audittrail.

De Evolutie van Data Warehouse Architecturen

Data Vault 2.0 is een modelleertechniek en methodology specifiek ontworpen voor grote, complexe enterprise data warehouses. Het lost de beperkingen op van traditionele modellen door schaalbaarheid, flexibiliteit en audittrail centraal te stellen.

Waarom Data Vault 2.0?

Traditionele data warehouses worden vaak inflexibel en duur om te onderhouden bij veranderingen. Data Vault 2.0 biedt een oplossing door:

  • Agile development: Parallelle ontwikkeling mogelijk
  • Incrementeel laden: Geen full reloads nodig
  • Traceability: Volledige audittrail van data
  • Scalability: Lineaire schaalbaarheid met data groei
  • Resilience: Fouttolerant door schema-on-write

De Data Vault 2.0 Architecture Lagen

Staging Area (Landing Zone)

Functie: Raw data van bronnen

Kenmerken: 1:1 met bron, geen transformaties

Retentie: 30-90 dagen (compliance)

Formaat: Meestal CSV, JSON, Parquet

Raw Data Vault (Core)

Functie: Historische data opslag

Kenmerken: Hubs, Links, Satellites

Retentie: Volledige historie

Formaat: Genormaliseerde tabellen

Business Vault & Information Marts

Functie: Business transformaties

Kenmerken: Business rules, PIT/Bridge

Retentie: Business requirements

Formaat: Dimensioneel (star schema)

Data Vault vs Andere Architecturen

Aspect Data Vault 2.0 Kimball (Dimensioneel) Inmon (3NF) Recommendatie
Primary Goal Agility & Scalability Query Performance Data Integration Afhankelijk van requirements
Development Speed ⭐⭐⭐⭐⭐ (Agile) ⭐⭐⭐ (Iteratief) ⭐⭐ (Waterfall) Data Vault voor snelle wijzigingen
Query Performance ⭐⭐ (via Business Vault) ⭐⭐⭐⭐⭐ (Excellent) ⭐⭐⭐ (Goed) Kimball voor directe queries
Historisch Beheer ⭐⭐⭐⭐⭐ (Ingebouwd) ⭐⭐ (Type 2 SCD) ⭐⭐⭐ (Goed) Data Vault wint op historie
Traceability ⭐⭐⭐⭐⭐ (Volledig) ⭐⭐ (Beperkt) ⭐⭐⭐ (Redelijk) Data Vault voor compliance
Enterprise Scale ⭐⭐⭐⭐⭐ (Lineair) ⭐⭐⭐ (Middel) ⭐⭐⭐⭐ (Goed) Data Vault voor grote EDW

De Drie Kern Componenten van Data Vault

Hubs: De Business Keys

1

Definitie & Purpose

  • Bevatten unieke business keys
  • Één hub per business entity
  • Voorbeelden: Customer, Product, Order
  • Geen descriptieve attributen
2

Structure & SQL Voorbeeld


CREATE TABLE hub_customer (
    customer_hash_key VARCHAR(32) PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,  -- Business key
    load_date TIMESTAMP NOT NULL,
    record_source VARCHAR(100) NOT NULL,
    UNIQUE (customer_id)
);

-- Hash key berekening (best practice)
-- MD5 of SHA van business key
customer_hash_key = MD5('CUST123')
                  
3

Design Principles

  • Gebruik hash keys (niet sequentiële)
  • Business keys zijn immutable
  • Één hub per business concept
  • Keep it simple: alleen keys

Links: De Relaties

1

Definitie & Purpose

  • Modelleert many-to-many relaties
  • Koppelt hubs aan elkaar
  • Voorbeelden: Order-Product, Customer-Account
  • Transaction context
2

Structure & SQL Voorbeeld


CREATE TABLE link_order_product (
    order_product_hash_key VARCHAR(32) PRIMARY KEY,
    order_hash_key VARCHAR(32) NOT NULL,
    product_hash_key VARCHAR(32) NOT NULL,
    load_date TIMESTAMP NOT NULL,
    record_source VARCHAR(100) NOT NULL,
    FOREIGN KEY (order_hash_key) REFERENCES hub_order(order_hash_key),
    FOREIGN KEY (product_hash_key) REFERENCES hub_product(product_hash_key),
    UNIQUE (order_hash_key, product_hash_key)
);

-- Many-to-many: één order, meerdere products
-- Één link per unieke combinatie
                  
3

Design Principles

  • Links hebben geen business betekenis
  • Gebruik composite hash keys
  • Één link per business transaction
  • Modelleer alle mogelijke relaties

Satellites: De Attributen

1

Definitie & Purpose

  • Bevatten descriptieve attributen
  • Historische veranderingen
  • Voorbeelden: Customer details, Product info
  • Slowly Changing Dimensions Type 2
2

Structure & SQL Voorbeeld


CREATE TABLE sat_customer_details (
    customer_hash_key VARCHAR(32) NOT NULL,
    load_date TIMESTAMP NOT NULL,
    hash_diff VARCHAR(32) NOT NULL,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20),
    city VARCHAR(50),
    country VARCHAR(50),
    record_source VARCHAR(100) NOT NULL,
    PRIMARY KEY (customer_hash_key, load_date),
    FOREIGN KEY (customer_hash_key) REFERENCES hub_customer(customer_hash_key)
);

-- Hash_diff = MD5 van alle attributen
-- Nieuwe rij alleen bij wijzigingen
                  
3

Design Principles

  • Split op change frequency
  • Gebruik hash_diff voor change detection
  • Één satellite per subject area
  • Keep descriptive data together

Geavanceerde Data Vault Constructies

Point-in-Time (PIT) Tables

PIT tables optimaliseren query performance door de historische satellite data te 'flatten' naar snapshot intervals.


CREATE TABLE pit_customer (
    customer_hash_key VARCHAR(32) NOT NULL,
    snapshot_date DATE NOT NULL,
    sat_customer_details_load_date TIMESTAMP,
    sat_customer_address_load_date TIMESTAMP,
    sat_customer_segment_load_date TIMESTAMP,
    PRIMARY KEY (customer_hash_key, snapshot_date)
);

-- Gebruik in queries:
SELECT 
    c.customer_name,
    a.city,
    s.segment_name
FROM pit_customer p
JOIN sat_customer_details c 
    ON p.customer_hash_key = c.customer_hash_key 
    AND p.sat_customer_details_load_date = c.load_date
JOIN sat_customer_address a
    ON p.customer_hash_key = a.customer_hash_key 
    AND p.sat_customer_address_load_date = a.load_date
WHERE p.snapshot_date = '2024-12-20';
          

Bridge Tables voor Performance

Bridge tables denormaliseren complexe many-to-many relaties voor betere query performance in business vault.


CREATE TABLE bridge_customer_products (
    customer_hash_key VARCHAR(32),
    product_hash_key VARCHAR(32),
    first_purchase_date DATE,
    last_purchase_date DATE,
    total_purchases INT,
    total_amount DECIMAL(10,2),
    PRIMARY KEY (customer_hash_key, product_hash_key)
);

-- Populatie vanuit Raw Vault:
INSERT INTO bridge_customer_products
SELECT 
    l.customer_hash_key,
    l.product_hash_key,
    MIN(s.load_date) AS first_purchase_date,
    MAX(s.load_date) AS last_purchase_date,
    COUNT(*) AS total_purchases,
    SUM(s.sales_amount) AS total_amount
FROM link_customer_order l
JOIN sat_order_details s ON l.order_hash_key = s.order_hash_key
GROUP BY l.customer_hash_key, l.product_hash_key;
          

Data Vault 2.0 ETL Patterns

Staging Load Pattern

-- 1. Truncate staging (afhankelijk van strategie)
TRUNCATE TABLE stg_customers;

-- 2. Load from source (CDC of full)
INSERT INTO stg_customers
SELECT 
    customer_id,
    customer_name,
    email,
    city,
    CURRENT_TIMESTAMP AS load_date,
    'CRM_SYSTEM' AS record_source
FROM source_crm.customers
WHERE last_modified > @last_load_date;

-- 3. Log metadata
INSERT INTO load_audit 
VALUES ('CUSTOMERS', @row_count, CURRENT_TIMESTAMP);
            
Hub Load Pattern

-- 1. Calculate hash keys
WITH hashed_data AS (
    SELECT 
        MD5(customer_id) AS customer_hash_key,
        customer_id,
        load_date,
        record_source
    FROM stg_customers
)
-- 2. Insert new hubs
INSERT INTO hub_customer (
    customer_hash_key,
    customer_id,
    load_date,
    record_source
)
SELECT 
    h.customer_hash_key,
    h.customer_id,
    h.load_date,
    h.record_source
FROM hashed_data h
LEFT JOIN hub_customer hub 
    ON h.customer_hash_key = hub.customer_hash_key
WHERE hub.customer_hash_key IS NULL;
            
Satellite Load Pattern

-- 1. Calculate hash_diff
WITH staged_data AS (
    SELECT 
        MD5(customer_id) AS customer_hash_key,
        MD5(CONCAT(customer_name, email, city)) AS hash_diff,
        customer_name,
        email,
        city,
        load_date,
        record_source
    FROM stg_customers
),
-- 2. Find latest satellite data
latest_satellite AS (
    SELECT 
        customer_hash_key,
        hash_diff,
        load_date
    FROM sat_customer_details
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_hash_key 
        ORDER BY load_date DESC
    ) = 1
)
-- 3. Insert only changed records
INSERT INTO sat_customer_details
SELECT 
    s.customer_hash_key,
    s.load_date,
    s.hash_diff,
    s.customer_name,
    s.email,
    s.city,
    s.record_source
FROM staged_data s
LEFT JOIN latest_satellite ls 
    ON s.customer_hash_key = ls.customer_hash_key
WHERE ls.hash_diff IS NULL 
    OR s.hash_diff != ls.hash_diff;
            

Business Vault & Information Marts

Retail Use Case: Van Raw Vault naar Data Mart

Business Vraag: "Wat zijn de dagelijkse verkopen per productcategorie per regio?"

1
Business Vault Layer (Transformations)

CREATE TABLE bv_daily_sales AS
SELECT 
    DATE(s.load_date) AS sales_date,
    p.product_hash_key,
    c.customer_hash_key,
    s.sales_amount,
    s.quantity,
    s.discount_amount
FROM sat_order_details s
JOIN link_order_product lop ON s.order_hash_key = lop.order_hash_key
JOIN hub_product p ON lop.product_hash_key = p.product_hash_key
JOIN link_customer_order lco ON s.order_hash_key = lco.order_hash_key
JOIN hub_customer c ON lco.customer_hash_key = c.customer_hash_key
WHERE s.load_date >= DATEADD('day', -30, CURRENT_DATE);
                  
2
Information Mart Layer (Star Schema)

-- Dimension: Date
CREATE TABLE dim_date AS
SELECT DISTINCT
    sales_date AS date_key,
    EXTRACT(YEAR FROM sales_date) AS year,
    EXTRACT(QUARTER FROM sales_date) AS quarter,
    EXTRACT(MONTH FROM sales_date) AS month,
    EXTRACT(DAY FROM sales_date) AS day,
    DAYNAME(sales_date) AS day_name
FROM bv_daily_sales;

-- Dimension: Product (conformed)
CREATE TABLE dim_product AS
SELECT 
    p.product_hash_key AS product_key,
    satp.product_name,
    satp.category,
    satp.brand,
    satp.unit_price
FROM hub_product p
JOIN sat_product_details satp ON p.product_hash_key = satp.product_hash_key
WHERE satp.load_date = (
    SELECT MAX(load_date) 
    FROM sat_product_details sp 
    WHERE sp.product_hash_key = p.product_hash_key
);

-- Fact: Daily Sales
CREATE TABLE fact_daily_sales AS
SELECT 
    d.date_key,
    p.product_key,
    c.customer_key,
    SUM(s.sales_amount) AS total_sales,
    SUM(s.quantity) AS total_quantity,
    COUNT(*) AS transaction_count
FROM bv_daily_sales s
JOIN dim_date d ON s.sales_date = d.date_key
JOIN dim_product p ON s.product_hash_key = p.product_key
JOIN dim_customer c ON s.customer_hash_key = c.customer_key
GROUP BY d.date_key, p.product_key, c.customer_key;
                  

Scalability & Performance Optimizations

Horizontale Schaalbaarheid
  • Hash-based distribution: Gelijkmatige data spread
  • Parallelle loads: Multiple sources tegelijk
  • Data partitioning: Op load_date of hash key
  • Sharding strategy: Per business unit of regio
  • Elastic compute: Auto-scaling resources
Query Performance
  • PIT tables: Snelle historische queries
  • Bridge tables: Pre-joined relationships
  • Materialized views: Voor common aggregaties
  • Columnstore indexes: Voor analytical queries
  • Query federation: Direct query op raw vault
Operationele Efficiëntie
  • Automated metadata: Schema discovery
  • Template-based ETL: Generieke load patterns
  • Data lineage: Automatische documentatie
  • Error handling: Robuuste foutafhandeling
  • Monitoring: Real-time performance metrics

Performance Metrics Benchmark

Scenario Data Volume Traditional EDW Data Vault 2.0 Improvement
Schema Change Add new attribute 2-3 dagen (impact analysis) 2-3 uren (add satellite) 90% sneller
New Source Integration 50+ tabellen 3-4 maanden 4-6 weken 60% sneller
Historical Query 5 jaar historie 45 seconden 8 seconden (via PIT) 82% sneller
Data Lineage Trace End-to-end Handmatig (dagen) Automatisch (minuten) 99% sneller

Data Vault 2.0 Implementatie Roadmap

Fase 1: Foundation & Pilot (Maand 1-3)

1
Team Setup & Training
  • Train team in Data Vault principles
  • Selecteer pilot use case
  • Defineer tools en standards
  • Stel governance framework op
2
Pilot Implementation
  • Implementeer staging area
  • Bouw eerste hubs, links, satellites
  • Maak eenvoudige business vault
  • Test met kleine dataset
3
Lessons Learned
  • Documenteer challenges
  • Refine templates en patterns
  • Stel performance baselines op
  • Plan scaling strategy

Fase 2: Scaling & Automation (Maand 4-9)

1
Enterprise Rollout
  • Voeg belangrijke bronnen toe
  • Implementeer conformed dimensions
  • Bouw enterprise business vault
  • Integreer data quality framework
2
Automation & DevOps
  • Implementeer CI/CD pipeline
  • Automatiseer metadata management
  • Stel monitoring en alerting op
  • Implementeer data lineage
3
Performance Optimization
  • Implementeer PIT en bridge tables
  • Optimaliseer query performance
  • Implementeer caching strategy
  • Scale infrastructure

Fase 3: Maturity & Innovation (Maand 10+)

1
Advanced Capabilities
  • Implementeer real-time streaming
  • Voeg machine learning features toe
  • Implementeer data marketplace
  • Enable self-service analytics
2
Continuous Improvement
  • Monitor en optimaliseer kosten
  • Implementeer FinOps practices
  • Regular architecture reviews
  • Stay current met nieuwe features

Data Vault 2.0 Tooling & Platform Support

Modeling Tools
  • ER/Studio: Native Data Vault support
  • SQLDBM: Cloud-based modeling
  • Visual Paradigm: Enterprise edition
  • SA Power Designer: Met add-ons
  • Custom templates: Visio, draw.io
Platform Support
  • Snowflake: Excellent (via streams)
  • Databricks: Good (Delta Lake)
  • Azure Synapse: Good (PolyBase)
  • Amazon Redshift: Good (spectrum)
  • Google BigQuery: Limited (schema evolution)
Automation Frameworks
  • dbtvault: Open source voor dbt
  • WhereScape: Commercial automation
  • Matillion: Cloud ETL met templates
  • Custom scripts: Python, SQL templates
  • Data Vault Automation: C#/.NET framework

Conclusie: Is Data Vault 2.0 Geschikt voor Jou?

Wanneer Data Vault 2.0 Wel/Niet Gebruiken

✅ Gebruik Data Vault 2.0 als...
  • Enterprise scale (> 100 TB data)
  • Veel bronnen (> 10 systemen)
  • Frequente schema wijzigingen
  • Audit en compliance requirements
  • Agile development nodig
  • Historisch beheer belangrijk
❌ Vermijd Data Vault 2.0 als...
  • Kleine dataset (< 1 TB)
  • Enkele bron systemen
  • Stabiele requirements
  • Directe query performance prioriteit
  • Beperkte resources beschikbaar
  • Geen historie beheer nodig
⚡ Hybrid Approach
  • Data Vault voor integration layer
  • Kimball voor presentation layer
  • Data Lake voor raw storage
  • Modern tools voor automation
  • Cloud-native architecture

Praktische Aanbevelingen

1

Start met een Pilot

  • Kies een beheersbare use case
  • Focus op leren, niet perfectie
  • Investeer in team training
  • Documenteer lessons learned
2

Investeer in Automation

  • Generieke load patterns
  • Metadata-driven approach
  • CI/CD voor data pipelines
  • Self-service data discovery
3

Plan voor Performance

  • PIT tables vanaf begin
  • Regelmatige performance tuning
  • Monitor en optimaliseer kosten
  • Architect voor scale

Finale gedachte: Data Vault 2.0 is geen silver bullet, maar een krachtige methodology voor complexe enterprise data warehouses. Het vereist investering in training, tooling en automatisering, maar betaalt zich terug in agility, scalability en maintainability op de lange termijn.

Data Vault Experts Nodig?

Vind ervaren Data Vault Architects en Engineers voor je enterprise data warehouse project

Data Vault Expert

Over de auteur

DataPartner365 Architecture Team - Onze data architects zijn gecertificeerd in Data Vault 2.0 en hebben uitgebreide ervaring met enterprise data warehouse implementaties. We helpen organisaties met het ontwerpen en implementeren van schaalbare, agile data architecturen volgens best practices.