Wat is het verschil tussen Data Vault 2.0 en traditioneel dimensioneel modelleren?

Data Vault is een modelleertechniek voor de data integration laag (raw data vault) en business vault, terwijl dimensioneel modelleren voor de presentation laag (data marts) is. Data Vault is agile en schaalbaar, dimensioneel is geoptimaliseerd voor queries.

Wanneer kies je voor Data Vault 2.0?

Data Vault 2.0 is ideaal voor grote enterprise data warehouses met veel bronnen, frequente wijzigingen, historie beheer nodig, en waar agility belangrijk is. Het is minder geschikt voor kleine, eenvoudige projecten.

Wat zijn de belangrijkste componenten van Data Vault 2.0?

De drie kern componenten zijn: 1) Hubs (business keys), 2) Links (relaties), 3) Satellites (descriptieve attributen). Daarnaast zijn er PIT (Point-in-Time) en Bridge tables voor performance.

Data Vault 2.0: De Moderne Enterprise Data Warehouse Architectuur

De Evolutie van Data Warehouse Architecturen

Data Vault 2.0 is een modelleertechniek en methodology specifiek ontworpen voor grote, complexe enterprise data warehouses. Het lost de beperkingen op van traditionele modellen door schaalbaarheid, flexibiliteit en audittrail centraal te stellen.

Waarom Data Vault 2.0?

Traditionele data warehouses worden vaak inflexibel en duur om te onderhouden bij veranderingen. Data Vault 2.0 biedt een oplossing door:

Agile development: Parallelle ontwikkeling mogelijk
Incrementeel laden: Geen full reloads nodig
Traceability: Volledige audittrail van data
Scalability: Lineaire schaalbaarheid met data groei
Resilience: Fouttolerant door schema-on-write

De Data Vault 2.0 Architecture Lagen

Staging Area (Landing Zone)

Functie: Raw data van bronnen

Kenmerken: 1:1 met bron, geen transformaties

Retentie: 30-90 dagen (compliance)

Formaat: Meestal CSV, JSON, Parquet

Raw Data Vault (Core)

Functie: Historische data opslag

Kenmerken: Hubs, Links, Satellites

Retentie: Volledige historie

Formaat: Genormaliseerde tabellen

Business Vault & Information Marts

Functie: Business transformaties

Kenmerken: Business rules, PIT/Bridge

Retentie: Business requirements

Formaat: Dimensioneel (star schema)

Data Vault vs Andere Architecturen

Aspect	Data Vault 2.0	Kimball (Dimensioneel)	Inmon (3NF)	Recommendatie
Primary Goal	Agility & Scalability	Query Performance	Data Integration	Afhankelijk van requirements
Development Speed	⭐⭐⭐⭐⭐ (Agile)	⭐⭐⭐ (Iteratief)	⭐⭐ (Waterfall)	Data Vault voor snelle wijzigingen
Query Performance	⭐⭐ (via Business Vault)	⭐⭐⭐⭐⭐ (Excellent)	⭐⭐⭐ (Goed)	Kimball voor directe queries
Historisch Beheer	⭐⭐⭐⭐⭐ (Ingebouwd)	⭐⭐ (Type 2 SCD)	⭐⭐⭐ (Goed)	Data Vault wint op historie
Traceability	⭐⭐⭐⭐⭐ (Volledig)	⭐⭐ (Beperkt)	⭐⭐⭐ (Redelijk)	Data Vault voor compliance
Enterprise Scale	⭐⭐⭐⭐⭐ (Lineair)	⭐⭐⭐ (Middel)	⭐⭐⭐⭐ (Goed)	Data Vault voor grote EDW

De Drie Kern Componenten van Data Vault

Hubs: De Business Keys

1

Definitie & Purpose

Bevatten unieke business keys
Één hub per business entity
Voorbeelden: Customer, Product, Order
Geen descriptieve attributen

2

Structure & SQL Voorbeeld


CREATE TABLE hub_customer (
    customer_hash_key VARCHAR(32) PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,  -- Business key
    load_date TIMESTAMP NOT NULL,
    record_source VARCHAR(100) NOT NULL,
    UNIQUE (customer_id)
);

-- Hash key berekening (best practice)
-- MD5 of SHA van business key
customer_hash_key = MD5('CUST123')

3

Design Principles

Gebruik hash keys (niet sequentiële)
Business keys zijn immutable
Één hub per business concept
Keep it simple: alleen keys

Links: De Relaties

1

Definitie & Purpose

Modelleert many-to-many relaties
Koppelt hubs aan elkaar
Voorbeelden: Order-Product, Customer-Account
Transaction context

2

Structure & SQL Voorbeeld


CREATE TABLE link_order_product (
    order_product_hash_key VARCHAR(32) PRIMARY KEY,
    order_hash_key VARCHAR(32) NOT NULL,
    product_hash_key VARCHAR(32) NOT NULL,
    load_date TIMESTAMP NOT NULL,
    record_source VARCHAR(100) NOT NULL,
    FOREIGN KEY (order_hash_key) REFERENCES hub_order(order_hash_key),
    FOREIGN KEY (product_hash_key) REFERENCES hub_product(product_hash_key),
    UNIQUE (order_hash_key, product_hash_key)
);

-- Many-to-many: één order, meerdere products
-- Één link per unieke combinatie

3

Design Principles

Links hebben geen business betekenis
Gebruik composite hash keys
Één link per business transaction
Modelleer alle mogelijke relaties

Satellites: De Attributen

1

Definitie & Purpose

Bevatten descriptieve attributen
Historische veranderingen
Voorbeelden: Customer details, Product info
Slowly Changing Dimensions Type 2

2

Structure & SQL Voorbeeld


CREATE TABLE sat_customer_details (
    customer_hash_key VARCHAR(32) NOT NULL,
    load_date TIMESTAMP NOT NULL,
    hash_diff VARCHAR(32) NOT NULL,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20),
    city VARCHAR(50),
    country VARCHAR(50),
    record_source VARCHAR(100) NOT NULL,
    PRIMARY KEY (customer_hash_key, load_date),
    FOREIGN KEY (customer_hash_key) REFERENCES hub_customer(customer_hash_key)
);

-- Hash_diff = MD5 van alle attributen
-- Nieuwe rij alleen bij wijzigingen

3

Design Principles

Split op change frequency
Gebruik hash_diff voor change detection
Één satellite per subject area
Keep descriptive data together

Geavanceerde Data Vault Constructies

Point-in-Time (PIT) Tables

PIT tables optimaliseren query performance door de historische satellite data te 'flatten' naar snapshot intervals.


CREATE TABLE pit_customer (
    customer_hash_key VARCHAR(32) NOT NULL,
    snapshot_date DATE NOT NULL,
    sat_customer_details_load_date TIMESTAMP,
    sat_customer_address_load_date TIMESTAMP,
    sat_customer_segment_load_date TIMESTAMP,
    PRIMARY KEY (customer_hash_key, snapshot_date)
);

-- Gebruik in queries:
SELECT 
    c.customer_name,
    a.city,
    s.segment_name
FROM pit_customer p
JOIN sat_customer_details c 
    ON p.customer_hash_key = c.customer_hash_key 
    AND p.sat_customer_details_load_date = c.load_date
JOIN sat_customer_address a
    ON p.customer_hash_key = a.customer_hash_key 
    AND p.sat_customer_address_load_date = a.load_date
WHERE p.snapshot_date = '2024-12-20';

Bridge Tables voor Performance

Bridge tables denormaliseren complexe many-to-many relaties voor betere query performance in business vault.


CREATE TABLE bridge_customer_products (
    customer_hash_key VARCHAR(32),
    product_hash_key VARCHAR(32),
    first_purchase_date DATE,
    last_purchase_date DATE,
    total_purchases INT,
    total_amount DECIMAL(10,2),
    PRIMARY KEY (customer_hash_key, product_hash_key)
);

-- Populatie vanuit Raw Vault:
INSERT INTO bridge_customer_products
SELECT 
    l.customer_hash_key,
    l.product_hash_key,
    MIN(s.load_date) AS first_purchase_date,
    MAX(s.load_date) AS last_purchase_date,
    COUNT(*) AS total_purchases,
    SUM(s.sales_amount) AS total_amount
FROM link_customer_order l
JOIN sat_order_details s ON l.order_hash_key = s.order_hash_key
GROUP BY l.customer_hash_key, l.product_hash_key;

Data Vault 2.0 ETL Patterns

Staging Load Pattern


-- 1. Truncate staging (afhankelijk van strategie)
TRUNCATE TABLE stg_customers;

-- 2. Load from source (CDC of full)
INSERT INTO stg_customers
SELECT 
    customer_id,
    customer_name,
    email,
    city,
    CURRENT_TIMESTAMP AS load_date,
    'CRM_SYSTEM' AS record_source
FROM source_crm.customers
WHERE last_modified > @last_load_date;

-- 3. Log metadata
INSERT INTO load_audit 
VALUES ('CUSTOMERS', @row_count, CURRENT_TIMESTAMP);

Hub Load Pattern


-- 1. Calculate hash keys
WITH hashed_data AS (
    SELECT 
        MD5(customer_id) AS customer_hash_key,
        customer_id,
        load_date,
        record_source
    FROM stg_customers
)
-- 2. Insert new hubs
INSERT INTO hub_customer (
    customer_hash_key,
    customer_id,
    load_date,
    record_source
)
SELECT 
    h.customer_hash_key,
    h.customer_id,
    h.load_date,
    h.record_source
FROM hashed_data h
LEFT JOIN hub_customer hub 
    ON h.customer_hash_key = hub.customer_hash_key
WHERE hub.customer_hash_key IS NULL;

Satellite Load Pattern


-- 1. Calculate hash_diff
WITH staged_data AS (
    SELECT 
        MD5(customer_id) AS customer_hash_key,
        MD5(CONCAT(customer_name, email, city)) AS hash_diff,
        customer_name,
        email,
        city,
        load_date,
        record_source
    FROM stg_customers
),
-- 2. Find latest satellite data
latest_satellite AS (
    SELECT 
        customer_hash_key,
        hash_diff,
        load_date
    FROM sat_customer_details
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_hash_key 
        ORDER BY load_date DESC
    ) = 1
)
-- 3. Insert only changed records
INSERT INTO sat_customer_details
SELECT 
    s.customer_hash_key,
    s.load_date,
    s.hash_diff,
    s.customer_name,
    s.email,
    s.city,
    s.record_source
FROM staged_data s
LEFT JOIN latest_satellite ls 
    ON s.customer_hash_key = ls.customer_hash_key
WHERE ls.hash_diff IS NULL 
    OR s.hash_diff != ls.hash_diff;

Business Vault & Information Marts

Retail Use Case: Van Raw Vault naar Data Mart

Business Vraag: "Wat zijn de dagelijkse verkopen per productcategorie per regio?"

1

Business Vault Layer (Transformations)


CREATE TABLE bv_daily_sales AS
SELECT 
    DATE(s.load_date) AS sales_date,
    p.product_hash_key,
    c.customer_hash_key,
    s.sales_amount,
    s.quantity,
    s.discount_amount
FROM sat_order_details s
JOIN link_order_product lop ON s.order_hash_key = lop.order_hash_key
JOIN hub_product p ON lop.product_hash_key = p.product_hash_key
JOIN link_customer_order lco ON s.order_hash_key = lco.order_hash_key
JOIN hub_customer c ON lco.customer_hash_key = c.customer_hash_key
WHERE s.load_date >= DATEADD('day', -30, CURRENT_DATE);

2

Information Mart Layer (Star Schema)


-- Dimension: Date
CREATE TABLE dim_date AS
SELECT DISTINCT
    sales_date AS date_key,
    EXTRACT(YEAR FROM sales_date) AS year,
    EXTRACT(QUARTER FROM sales_date) AS quarter,
    EXTRACT(MONTH FROM sales_date) AS month,
    EXTRACT(DAY FROM sales_date) AS day,
    DAYNAME(sales_date) AS day_name
FROM bv_daily_sales;

-- Dimension: Product (conformed)
CREATE TABLE dim_product AS
SELECT 
    p.product_hash_key AS product_key,
    satp.product_name,
    satp.category,
    satp.brand,
    satp.unit_price
FROM hub_product p
JOIN sat_product_details satp ON p.product_hash_key = satp.product_hash_key
WHERE satp.load_date = (
    SELECT MAX(load_date) 
    FROM sat_product_details sp 
    WHERE sp.product_hash_key = p.product_hash_key
);

-- Fact: Daily Sales
CREATE TABLE fact_daily_sales AS
SELECT 
    d.date_key,
    p.product_key,
    c.customer_key,
    SUM(s.sales_amount) AS total_sales,
    SUM(s.quantity) AS total_quantity,
    COUNT(*) AS transaction_count
FROM bv_daily_sales s
JOIN dim_date d ON s.sales_date = d.date_key
JOIN dim_product p ON s.product_hash_key = p.product_key
JOIN dim_customer c ON s.customer_hash_key = c.customer_key
GROUP BY d.date_key, p.product_key, c.customer_key;

Scalability & Performance Optimizations

Horizontale Schaalbaarheid

Hash-based distribution: Gelijkmatige data spread
Parallelle loads: Multiple sources tegelijk
Data partitioning: Op load_date of hash key
Sharding strategy: Per business unit of regio
Elastic compute: Auto-scaling resources

Query Performance

PIT tables: Snelle historische queries
Bridge tables: Pre-joined relationships
Materialized views: Voor common aggregaties
Columnstore indexes: Voor analytical queries
Query federation: Direct query op raw vault

Operationele Efficiëntie

Automated metadata: Schema discovery
Template-based ETL: Generieke load patterns
Data lineage: Automatische documentatie
Error handling: Robuuste foutafhandeling
Monitoring: Real-time performance metrics

Performance Metrics Benchmark

Scenario	Data Volume	Traditional EDW	Data Vault 2.0	Improvement
Schema Change	Add new attribute	2-3 dagen (impact analysis)	2-3 uren (add satellite)	90% sneller
New Source Integration	50+ tabellen	3-4 maanden	4-6 weken	60% sneller
Historical Query	5 jaar historie	45 seconden	8 seconden (via PIT)	82% sneller
Data Lineage Trace	End-to-end	Handmatig (dagen)	Automatisch (minuten)	99% sneller

Data Vault 2.0 Implementatie Roadmap

Fase 1: Foundation & Pilot (Maand 1-3)

1

Team Setup & Training

Train team in Data Vault principles
Selecteer pilot use case
Defineer tools en standards
Stel governance framework op

2

Pilot Implementation

Implementeer staging area
Bouw eerste hubs, links, satellites
Maak eenvoudige business vault
Test met kleine dataset

3

Lessons Learned

Documenteer challenges
Refine templates en patterns
Stel performance baselines op
Plan scaling strategy

Fase 2: Scaling & Automation (Maand 4-9)

1

Enterprise Rollout

Voeg belangrijke bronnen toe
Implementeer conformed dimensions
Bouw enterprise business vault
Integreer data quality framework

2

Automation & DevOps

Implementeer CI/CD pipeline
Automatiseer metadata management
Stel monitoring en alerting op
Implementeer data lineage

3

Performance Optimization

Implementeer PIT en bridge tables
Optimaliseer query performance
Implementeer caching strategy
Scale infrastructure

Fase 3: Maturity & Innovation (Maand 10+)

1

Advanced Capabilities

Implementeer real-time streaming
Voeg machine learning features toe
Implementeer data marketplace
Enable self-service analytics

2

Continuous Improvement

Monitor en optimaliseer kosten
Implementeer FinOps practices
Regular architecture reviews
Stay current met nieuwe features

Data Vault 2.0 Tooling & Platform Support

Modeling Tools

ER/Studio: Native Data Vault support
SQLDBM: Cloud-based modeling
Visual Paradigm: Enterprise edition
SA Power Designer: Met add-ons
Custom templates: Visio, draw.io

Platform Support

Snowflake: Excellent (via streams)
Databricks: Good (Delta Lake)
Azure Synapse: Good (PolyBase)
Amazon Redshift: Good (spectrum)
Google BigQuery: Limited (schema evolution)

Automation Frameworks

dbtvault: Open source voor dbt
WhereScape: Commercial automation
Matillion: Cloud ETL met templates
Custom scripts: Python, SQL templates
Data Vault Automation: C#/.NET framework

Conclusie: Is Data Vault 2.0 Geschikt voor Jou?

 Wanneer Data Vault 2.0 Wel/Niet Gebruiken
              ✅ Gebruik Data Vault 2.0 als...
              Enterprise scale (> 100 TB data)
Veel bronnen (> 10 systemen)
Frequente schema wijzigingen
Audit en compliance requirements
Agile development nodig
Historisch beheer belangrijk

            

              ❌ Vermijd Data Vault 2.0 als...
              Kleine dataset (< 1 TB)
Enkele bron systemen
Stabiele requirements
Directe query performance prioriteit
Beperkte resources beschikbaar
Geen historie beheer nodig

            

              ⚡ Hybrid Approach
              Data Vault voor integration layer
Kimball voor presentation layer
Data Lake voor raw storage
Modern tools voor automation
Cloud-native architecture

            

Praktische Aanbevelingen

1

Start met een Pilot

Kies een beheersbare use case
Focus op leren, niet perfectie
Investeer in team training
Documenteer lessons learned

2

Investeer in Automation

Generieke load patterns
Metadata-driven approach
CI/CD voor data pipelines
Self-service data discovery

3

Plan voor Performance

PIT tables vanaf begin
Regelmatige performance tuning
Monitor en optimaliseer kosten
Architect voor scale

Finale gedachte: Data Vault 2.0 is geen silver bullet, maar een krachtige methodology voor complexe enterprise data warehouses. Het vereist investering in training, tooling en automatisering, maar betaalt zich terug in agility, scalability en maintainability op de lange termijn.

Data Vault 2.0: De Moderne Enterprise Data Warehouse Architectuur

De Evolutie van Data Warehouse Architecturen

Waarom Data Vault 2.0?

De Data Vault 2.0 Architecture Lagen

Staging Area (Landing Zone)

Raw Data Vault (Core)

Business Vault & Information Marts

Data Vault vs Andere Architecturen

De Drie Kern Componenten van Data Vault

Hubs: De Business Keys

Definitie & Purpose

Structure & SQL Voorbeeld

Design Principles

Links: De Relaties

Definitie & Purpose

Structure & SQL Voorbeeld

Design Principles

Satellites: De Attributen

Definitie & Purpose

Structure & SQL Voorbeeld

Design Principles

Geavanceerde Data Vault Constructies

Point-in-Time (PIT) Tables

Bridge Tables voor Performance

Data Vault 2.0 ETL Patterns

Staging Load Pattern

Hub Load Pattern

Satellite Load Pattern

Business Vault & Information Marts

Retail Use Case: Van Raw Vault naar Data Mart

Business Vault Layer (Transformations)

Information Mart Layer (Star Schema)

Scalability & Performance Optimizations

Horizontale Schaalbaarheid

Query Performance

Operationele Efficiëntie

Performance Metrics Benchmark

Data Vault 2.0 Implementatie Roadmap

Fase 1: Foundation & Pilot (Maand 1-3)

Team Setup & Training

Pilot Implementation

Lessons Learned

Fase 2: Scaling & Automation (Maand 4-9)

Enterprise Rollout

Automation & DevOps

Performance Optimization

Fase 3: Maturity & Innovation (Maand 10+)

Advanced Capabilities

Continuous Improvement

Data Vault 2.0 Tooling & Platform Support

Modeling Tools

Platform Support

Automation Frameworks

Conclusie: Is Data Vault 2.0 Geschikt voor Jou?

Wanneer Data Vault 2.0 Wel/Niet Gebruiken

✅ Gebruik Data Vault 2.0 als...

❌ Vermijd Data Vault 2.0 als...

⚡ Hybrid Approach

Praktische Aanbevelingen

Start met een Pilot

Investeer in Automation

Plan voor Performance

Data Vault Experts Nodig?

Deel dit artikel

Gerelateerde Artikelen:

Over de auteur