SCD Type 2 in Databricks: Complete Handleiding

Laatst bijgewerkt: 13 november 2025

Leestijd: 15 minuten

SCD Type 2, Databricks, Delta Lake, Data Warehouse, Historie

Leer SCD Type 2 implementeren in Databricks voor complete historie tracking in je data warehouse. Van basisconcepten tot geavanceerde implementatie met Delta Lake.

Zoek je Databricks experts?

Vind gespecialiseerde Data Engineers voor SCD implementaties en data warehouse projecten

Data Engineer Vacatures Vacature Plaassen €25

Wat is SCD Type 2?

Slowly Changing Dimensions Type 2 is een data warehouse techniek voor het bijhouden van volledige historie van dimension records. Bij wijzigingen wordt een nieuwe record aangemaakt in plaats van de bestaande record te updaten, waardoor alle historische versies behouden blijven.

Waarom SCD Type 2 gebruiken?

SCD Type 2 is essentieel voor accurate historische rapportage en analyse:

Complete historie: Alle versies van dimension records blijven beschikbaar
Tijdreizen: Rapportages kunnen worden gedraaid voor elk historisch tijdstip
Compliance: Voldoet aan audit en regulatory requirements
Accurate analyse: Historische trends kunnen correct worden geanalyseerd
Data integriteit: Geen verlies van historische informatie

Belangrijkste Inzicht

SCD Type 2 is vooral cruciaal voor dimensions die belangrijke bedrijfsentiteiten vertegenwoordigen zoals klanten, producten of medewerkers, waar wijzigingen betekenisvol zijn voor business intelligence en rapportage.

SCD Types Overzicht

Type 0 - Static

Dimensies veranderen nooit. Originele waarden blijven altijd behouden.

Gebruik: Geboortedatum, geboorteplaats
Geen historie bijgehouden

Type 2 - Full History

Bij elke wijziging nieuwe record met start/eind datum. Volledige historie behouden.

Gebruik: Klantadres, productcategorie
Complete historie tracking

Type 1 - Overwrite

Oude waarden worden overschreven. Geen historie bijgehouden.

Gebruik: Spellingscorrecties
Geen historie behouden

SCD Type 2 in Actie: Praktijkvoorbeeld

Klant Dimension - Historie Tracking

Een klant verhuist van Amsterdam naar Rotterdam. In plaats van het adres te updaten, wordt een nieuwe record aangemaakt:

CustomerSK	CustomerID	Naam	Stad	StartDate	EndDate	IsCurrent
1001	CUST001	Jan Jansen	Amsterdam	2023-01-01	2023-06-14	Inactive
1002	CUST001	Jan Jansen	Rotterdam	2023-06-15	9999-12-31	Active

Resultaat: We kunnen nu zowel de huidige situatie (Rotterdam) als de historische situatie (Amsterdam) rapporteren.

Team nodig voor data warehouse implementatie?

Vind ervaren Data Engineers gespecialiseerd in SCD patterns en Databricks

Data Engineer Vacatures Zoek Engineering Talent

SCD Type 2 Implementatie in Databricks

Stap 1: Dimension Table Structuur

Creëer een dimension table met de benodigde historie velden:

Surrogate Key: Unieke identifier voor elke record
Business Key: Herkenning vanuit bronsysteem
StartDate: Begin geldigheid
EndDate: Eind geldigheid (9999-12-31 voor actieve records)
IsCurrent: Boolean indicator voor actieve record

Stap 2: Merge Logic

Implementeer de SCD Type 2 logica met Databricks MERGE statement:

Databricks SQL - SCD Type 2 Merge

-- SCD Type 2 Implementation in Databricks
MERGE INTO dim_customer target
USING (
  SELECT 
    customer_id,
    customer_name,
    city,
    email,
    CURRENT_TIMESTAMP() as effective_from,
    CAST('9999-12-31' as TIMESTAMP) as effective_to,
    true as is_current
  FROM staging_customers
) source
ON target.customer_id = source.customer_id AND target.is_current = true

WHEN MATCHED AND (
  target.city <> source.city OR
  target.email <> source.email
) THEN
  UPDATE SET 
    target.is_current = false,
    target.effective_to = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (
    customer_sk, customer_id, customer_name, city, email,
    effective_from, effective_to, is_current
  )
  VALUES (
    UUID(), source.customer_id, source.customer_name, 
    source.city, source.email, source.effective_from,
    source.effective_to, source.is_current
  )

Stap 3: Incremental Processing

Optimaliseer voor grote datasets met incremental processing:

PySpark - Incremental SCD Type 2

from pyspark.sql import functions as F
from delta.tables import DeltaTable

def scd_type_2_incremental(dim_table_path, updates_df):
    """
    Incremental SCD Type 2 processing for large datasets
    """
    # Load dimension table
    dim_table = DeltaTable.forPath(spark, dim_table_path)
    
    # Find records that need to be expired
    updates_with_current = updates_df.alias('updates').join(
        dim_table.toDF().alias('dim'),
        (F.col('updates.customer_id') == F.col('dim.customer_id')) & 
        (F.col('dim.is_current') == True),
        'inner'
    )
    
    # Identify changed records
    changed_records = updates_with_current.filter(
        (F.col('updates.city') != F.col('dim.city')) |
        (F.col('updates.email') != F.col('dim.email'))
    )
    
    # Perform merge operation
    dim_table.alias('target').merge(
        updates_df.alias('source'),
        'target.customer_id = source.customer_id AND target.is_current = true'
    ).whenMatchedUpdate(
        condition='target.city <> source.city OR target.email <> source.email',
        set={
            'is_current': 'false',
            'effective_to': 'current_timestamp()'
        }
    ).whenNotMatchedInsertAll().execute()

Best Practices voor SCD Type 2

         Aanbevolen Praktijken
        Consistente surrogate keys: Gebruik UUID of identity columns
Effectieve dating: Gebruik TIMESTAMP ipv DATE voor precieze tracking
Indexering: Optimaliseer queries met indexes op business key + is_current
Incremental processing: Verwerk alleen gewijzigde records
Data quality checks: Valideer op overlappende date ranges
Partitionering: Partitioneer op is_current voor performance

      

Veelvoorkomende Uitdagingen

Uitdaging	Oplossing	Impact
Performance bij grote volumes	Incremental processing + partitionering	Hoge performance impact
Complexe merge logic	Gestandaardiseerde templates	Onderhoudbaarheid
Data quality issues	Validatie checks in pipeline	Data integriteit
Storage growth	Data retention policies	Kostenbeheer

Klaar voor SCD Type 2 implementatie?

Vind de juiste experts of plaats je Data Engineering vacature

Vacature Plaassen €25 Bekijk Engineering Vacatures

Geavanceerde SCD Type 2 Patterns

1. Hybrid SCD Approach

Combineer SCD Type 1 en Type 2 voor optimale performance:

-- Hybrid approach: Type 2 for important attributes, Type 1 for others
MERGE INTO dim_product target
USING source_data source
ON target.product_id = source.product_id AND target.is_current = true

WHEN MATCHED AND (
  -- Type 2 changes (important attributes)
  target.category <> source.category OR
  target.price_tier <> source.price_tier
) THEN
  UPDATE SET is_current = false, end_date = CURRENT_DATE()

WHEN MATCHED AND (
  -- Type 1 changes (minor attributes)  
  target.description <> source.description
) THEN
  UPDATE SET description = source.description

WHEN NOT MATCHED THEN
  INSERT (product_sk, product_id, category, price_tier, description, ...)
  VALUES (UUID(), source.product_id, source.category, source.price_tier, ...)

2. SCD Type 2 met Delta Lake Time Travel

Benut Databricks Delta Lake features voor enhanced historie:

-- Combine SCD Type 2 with Delta Time Travel
-- Query historical version of dimension
SELECT * FROM dim_customer VERSION AS OF 10;

-- Track changes between versions
DESCRIBE HISTORY dim_customer;

-- Restore previous version if needed
RESTORE TABLE dim_customer TO VERSION AS OF 5;

-- Optimize table for better performance
OPTIMIZE dim_customer ZORDER BY (customer_id, is_current);

Start met SCD Type 2 vandaag!

Vind gespecialiseerde Data Engineers of plaats je vacature

Zoek Data Engineers Plaas Vacature €25