DataPartner365

Jouw partner voor datagedreven groei en inzichten

Delta Lake: De Foundation voor Moderne Data Lakehouses

Gepubliceerd: 20 december 2024
Leestijd: 11 minuten
Delta Lake, Data Lakehouse, ACID Transactions, Apache Spark, Databricks, Data Reliability, Time Travel
Expert niveau: Intermediate tot Advanced

Een complete gids voor Delta Lake: het open-source storage framework dat data lakes transformeert met ACID transactions, schema enforcement, time travel en data reliability. Leer hoe je enterprise-grade data lakehouses bouwt.

Van Data Lake naar Data Lakehouse

Delta Lake is een open-source storage layer die betrouwbaarheid toevoegt aan data lakes. Het combineert de schaalbaarheid en kostenefficiëntie van data lakes met de ACID-guarantees en data management features van data warehouses.

Het Probleem met Traditionele Data Lakes

Traditionele data lakes (op Parquet, ORC, Avro) hebben verschillende problemen:

  • Geen ACID transactions: Concurrent writes veroorzaken data corruption
  • Geen schema enforcement: Dirty data komt in de lake
  • Geen time travel: Onmogelijk om naar vorige versies te gaan
  • Slechte performance: Geen indexing, caching, of data skipping
  • Complex onderhoud: Handmatig vacuum en compact nodig

Delta Lake lost al deze problemen op met een transaction log en metadata management.

De 5 Kern Features van Delta Lake

ACID Transactions

Atomicity: Alles of niets commits

Consistency: Schema enforcement

Isolation: Serializable isolation level

Durability: Transaction log garandeert

Time Travel

Data versioning: Volledige historie

Audit trail: Wie deed wat wanneer

Rollback: Naar eerdere versies gaan

Reproduceerbaar: Exacte data op moment X

Schema Evolution

Schema enforcement: Voorkomt dirty data

Schema evolution: Voeg columns toe veilig

Schema inference: Automatisch van data

Data quality: Validation constraints

Performance Optimizations

Data skipping: Min/max statistics

Z-ordering: Multi-dimensional clustering

Caching: Delta cache voor snel lezen

Compaction: Automatisch kleine files

Streaming + Batch Unification

Single source: Batch en streaming

Incremental processing: Alleen nieuwe data

Change Data Feed: Track wijzigingen

Unified pipeline: ETL en streaming

Delta Lake Architectuur: Hoe Het Werkt

De Delta Lake Stack

Laag Component Functie Voorbeeld
Storage Cloud Object Storage Data persistence Azure Data Lake, Amazon S3
Transaction Layer Delta Log (JSON) ACID guarantees, metadata _delta_log/ folder
Data Format Parquet Files Columnar storage .parquet files
Processing Engine Apache Spark Distributed processing Spark SQL, DataFrames
APIs Delta Lake APIs CRUD operations Python, Scala, SQL

Delta Log: Het Hart van Delta Lake

De Delta Log is een transaction log die alle wijzigingen bijhoudt in JSON files:


/delta/customers/
├── _delta_log/
│   ├── 00000000000000000000.json  # Create table
│   ├── 00000000000000000001.json  # Insert 1000 rows
│   ├── 00000000000000000002.json  # Update 50 rows
│   └── 00000000000000000003.json  # Delete 10 rows
├── part-00000-*.parquet           # Data file v0
├── part-00001-*.parquet           # Data file v1
└── part-00002-*.parquet           # Data file v3

-- Elke JSON entry:
{
  "add": {
    "path": "part-00000-xxx.parquet",
    "size": 123456,
    "modificationTime": 1609459200000,
    "dataChange": true,
    "stats": "{\"numRecords\":1000,\"minValues\":...}"
  }
}
          

Key inzichten: De Delta Log maakt time travel mogelijk door alle versies bij te houden. Optimizations zoals VACUUM en OPTIMIZE werken met deze log.

Praktische Delta Lake Implementatie

Stap 1: Delta Table Maken

1

Python (PySpark)


from pyspark.sql import SparkSession
from delta import *

# Spark sessie met Delta configuratie
spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions", 
            "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Maak Delta table
df = spark.createDataFrame([
    (1, "Alice", "Amsterdam", 50000),
    (2, "Bob", "Rotterdam", 60000),
    (3, "Charlie", "Utrecht", 55000)
], ["id", "name", "city", "salary"])

df.write.format("delta") \
    .mode("overwrite") \
    .save("/delta/customers")
                  
2

SQL (Spark SQL)


-- Maak Delta table met SQL
CREATE TABLE customers (
    id INT,
    name STRING,
    city STRING,
    salary DECIMAL(10,2)
) USING DELTA
LOCATION '/delta/customers';

-- Of converteer bestaande table
CONVERT TO DELTA parquet.`/data/customers`;

-- Check table details
DESCRIBE HISTORY customers;
DESCRIBE DETAIL customers;
                  

Stap 2: CRUD Operations

1

Updates & Deletes (ACID)


-- Update (traditionele data lakes kunnen dit niet)
UPDATE customers 
SET salary = salary * 1.1 
WHERE city = 'Amsterdam';

-- Delete met condition
DELETE FROM customers 
WHERE salary < 30000;

-- Merge/Upsert operation
MERGE INTO customers target
USING updates source
ON target.id = source.id
WHEN MATCHED THEN 
    UPDATE SET target.salary = source.salary
WHEN NOT MATCHED THEN 
    INSERT (id, name, city, salary) 
    VALUES (source.id, source.name, 
            source.city, source.salary);
                  
2

Time Travel Queries


-- Lees specifieke versie
SELECT * FROM customers 
VERSION AS OF 2;

-- Lees op timestamp
SELECT * FROM customers 
TIMESTAMP AS OF '2024-12-20 10:00:00';

-- Vergelijk versies
SELECT * FROM 
(SELECT *, 'current' as version FROM customers) current
FULL OUTER JOIN
(SELECT *, 'previous' as version 
 FROM customers VERSION AS OF 2) previous
ON current.id = previous.id
WHERE current.salary != previous.salary 
   OR current.salary IS NULL 
   OR previous.salary IS NULL;

-- Restore naar vorige versie
RESTORE TABLE customers TO VERSION AS OF 2;
                  

Stap 3: Performance Optimizations

1

Z-Ordering & Compaction


-- OPTIMIZE: Compact small files
OPTIMIZE customers
WHERE city = 'Amsterdam';

-- Z-ORDER: Co-locate related data
OPTIMIZE customers
ZORDER BY (city, id);

-- Auto compaction configuratie
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

-- Check table metrics
DESCRIBE DETAIL customers;
-- numFiles, sizeInBytes, etc.
                  
2

Data Skipping & Indexing


-- Delta Lake gebruikt automatisch statistics
-- Geen separate index nodig

-- Check query plan met data skipping
EXPLAIN SELECT * FROM customers 
WHERE city = 'Amsterdam' AND salary > 50000;

-- Bloom filters (experimental)
ALTER TABLE customers 
SET TBLPROPERTIES (
    'delta.bloomFilter.columns' = 'city,id',
    'delta.bloomFilter.fpp' = '0.1'
);

-- Generated columns voor performance
ALTER TABLE customers 
ADD COLUMN city_lower STRING 
GENERATED ALWAYS AS (LOWER(city));
                  

Schema Management & Evolution

Schema Enforcement

-- Strict schema (default)
spark.conf.set(
    "spark.databricks.delta.schema.autoMerge.enabled", 
    "false"
);

-- Probeer foute data te schrijven
df_bad = spark.createDataFrame([
    (4, "Diana", "Den Haag", "high")  # salary als string!
], ["id", "name", "city", "salary"])

-- Dit faalt met:
-- AnalysisException: A schema mismatch detected...
df_bad.write.format("delta") \
    .mode("append") \
    .save("/delta/customers")
            
Schema Evolution

-- Sta schema evolution toe
spark.conf.set(
    "spark.databricks.delta.schema.autoMerge.enabled", 
    "true"
);

-- Of tijdens schrijven
df_new = spark.createDataFrame([
    (5, "Erik", "Eindhoven", 65000, "erik@email.nl")
], ["id", "name", "city", "salary", "email"])

df_new.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/delta/customers")

-- Manuele schema wijziging
ALTER TABLE customers 
ADD COLUMN department STRING AFTER city;

ALTER TABLE customers 
CHANGE COLUMN salary salary DECIMAL(12,2);
            
Data Quality Constraints

-- Add constraints
ALTER TABLE customers 
ADD CONSTRAINT salary_positive 
CHECK (salary > 0);

ALTER TABLE customers 
ADD CONSTRAINT valid_city 
CHECK (city IN ('Amsterdam', 'Rotterdam', 
                'Utrecht', 'Den Haag', 
                'Eindhoven'));

-- Constraint violation faalt
UPDATE customers 
SET salary = -100 
WHERE id = 1;
-- DeltaInvariantViolationException

-- Check constraints
SHOW CONSTRAINTS customers;

-- Drop constraint
ALTER TABLE customers 
DROP CONSTRAINT salary_positive;
            

Streaming & Change Data Capture

Real-time Data Pipeline met Delta Lake

1
Streaming Reads & Writes

# Read stream from Kafka
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "customers") \
    .load()

# Parse JSON
from pyspark.sql.functions import from_json, col
schema = "id INT, name STRING, city STRING, salary DECIMAL(10,2)"
parsed_df = stream_df.select(
    from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")

# Write to Delta Lake (streaming)
query = parsed_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/delta/checkpoints/customers") \
    .start("/delta/customers_stream")

# Batch queries op streaming table
spark.read.format("delta") \
    .load("/delta/customers_stream") \
    .createOrReplaceTempView("customers_stream")

SELECT COUNT(*) as live_count 
FROM customers_stream;
                  
2
Change Data Feed (CDF)

-- Enable Change Data Feed
ALTER TABLE customers 
SET TBLPROPERTIES (
    delta.enableChangeDataFeed = true
);

-- Read changes
SELECT * FROM table_changes('customers', 2, 4)

-- Of via Python
spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 2) \
    .option("endingVersion", 4) \
    .load("/delta/customers") \
    .show()

-- CDC voor incremental processing
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingTimestamp", "2024-12-20") \
    .load("/delta/customers")

-- Alleen nieuwe/gewijzigde rijen
changes_df.filter(
    (col("_change_type") == "insert") | 
    (col("_change_type") == "update_postimage")
).write.format("delta") \
    .mode("append") \
    .save("/delta/customers_silver")
                  

Delta Lake Medallion Architecture

Bronze → Silver → Gold

Bronze Layer (Raw)
  • Functie: Landing zone
  • Data: Raw, immutable
  • Schema: As-is van bron
  • Opslag: Delta tables
  • Voorbeeld: JSON logs, CDC streams

-- Create bronze table
CREATE TABLE bronze_customers
USING DELTA
LOCATION '/delta/bronze/customers'
AS SELECT * FROM customers_raw;
              
Silver Layer (Cleaned)
  • Functie: Enriched data
  • Data: Cleaned, validated
  • Schema: Enterprise model
  • Opslag: Delta tables
  • Voorbeeld: Merged, deduplicated

-- Clean and enrich
CREATE TABLE silver_customers
USING DELTA
LOCATION '/delta/silver/customers'
AS SELECT 
    id,
    TRIM(name) as name,
    UPPER(city) as city,
    CAST(salary AS DECIMAL(10,2)) as salary
FROM bronze_customers
WHERE id IS NOT NULL;
              
Gold Layer (Business)
  • Functie: Business ready
  • Data: Aggregated, modeled
  • Schema: Star schema
  • Opslag: Delta tables
  • Voorbeeld: Data marts, aggregates

-- Business aggregates
CREATE TABLE gold_customer_summary
USING DELTA
LOCATION '/delta/gold/customer_summary'
AS SELECT 
    city,
    COUNT(*) as customer_count,
    AVG(salary) as avg_salary,
    SUM(salary) as total_salary
FROM silver_customers
GROUP BY city;
              

Platform Integration & Ecosystem

Azure Integration
  • Azure Databricks: Native Delta support
  • Azure Synapse: Serverless SQL pools
  • Azure Data Factory: Copy activity
  • Power BI: DirectQuery op Delta
  • Azure Purview: Data lineage
AWS Integration
  • Databricks on AWS: S3 integration
  • AWS Glue: ETL jobs met Delta
  • Amazon Athena: Federated queries
  • EMR: Spark clusters met Delta
  • Redshift Spectrum: External tables
Tools & Frameworks
  • dbt: Delta adapter
  • Apache Airflow: Orchestration
  • Great Expectations: Data quality
  • MLflow: Experiment tracking
  • Delta Live Tables: Pipeline orchestration

Performance Benchmarks

Operation Parquet Delta Lake Improvement Reden
UPDATE query Rewrite full table Only changed files 10-100x sneller Incremental processing
SELECT with filter Full scan Data skipping 5-20x sneller Min/max statistics
Concurrent writes Fails/corruption ACID guaranteed 100% reliable Transaction log
Time travel query Not possible Instant N/A Delta Log tracking
Small file problem Manual compaction Auto OPTIMIZE 90% minder werk Automatic management

Delta Lake Best Practices

Do's
  • Gebruik OPTIMIZE regelmatig
  • Implementeer Z-ORDER op filter columns
  • Gebruik medallion architecture
  • Stel VACUUM retention goed in
  • Monitor table statistics
  • Gebruik constraints voor data quality
  • Enable Change Data Feed voor CDC
Don'ts
  • Schrijf niet direct naar Parquet
  • Laat niet te veel kleine files
  • Gebruik niet te veel Z-ORDER columns
  • Vergeet checkpoint niet bij streaming
  • Schakel schema enforcement niet uit
  • Gebruik niet te korte VACUUM retention
  • Mix niet Delta met non-Delta tables
Monitoring & Maintenance
  • Monitor numFiles en sizeInBytes
  • Check query performance metrics
  • Review Delta Log groei
  • Set up alerts voor failures
  • Regular VACUUM van oude versies
  • Document schema changes
  • Test time travel regelmatig

Conclusie & Toekomst van Data Lakehouses

Wanneer Delta Lake Gebruiken?

✅ Perfect voor...
  • ETL pipelines met updates/deletes
  • Real-time + batch unification
  • Data lakes met ACID nodig
  • Time travel & audit requirements
  • Schema evolution nodig
  • Lakehouse architecture
⚠️ Overwegen bij...
  • Append-only workloads
  • Zeer kleine datasets (< 1GB)
  • Read-only access patterns
  • Geen Spark ecosystem
  • Eenvoudige file exports nodig
🚀 Future Trends
  • Delta Lake 3.0: UniForm format
  • Delta Sharing: Open sharing
  • Delta Live Tables: Simplification
  • GPU acceleration
  • Serverless query engines

Implementatie Checklist

1

Foundation Setup

  • Kies cloud storage (S3, ADLS, GCS)
  • Setup Spark cluster met Delta
  • Configureer medallion directories
  • Implementeer data governance
2

Pipeline Development

  • Bouw bronze → silver → gold pipelines
  • Implementeer schema management
  • Configureer streaming workloads
  • Setup monitoring en alerting
3

Optimization & Scale

  • Implementeer OPTIMIZE schedules
  • Configureer Z-ORDER strategie
  • Setup auto-compaction
  • Monitor en tune performance

Finale aanbeveling: Begin met een pilot project om Delta Lake features te testen. Focus eerst op ACID transactions en time travel, dan op performance optimizations. Delta Lake transformeert data lakes van "data swamps" naar enterprise-grade data platforms.

Delta Lake Experts Nodig?

Vind ervaren Data Engineers en Spark Developers voor je data lakehouse implementatie

Delta Lake Expert

Over de auteur

DataPartner365 Engineering Team - Onze data engineers hebben uitgebreide ervaring met Delta Lake implementaties voor enterprise klanten. We helpen organisaties met het bouwen van schaalbare, betrouwbare data lakehouses op Delta Lake, Apache Spark en cloud platforms.