Wat is het verschil tussen Delta Lake en een traditionele data lake?

Traditionele data lakes gebruiken Parquet/ORC files zonder transaction guarantees. Delta Lake voegt ACID transactions, time travel, schema evolution en data reliability features toe bovenop Parquet.

Hoe werkt time travel in Delta Lake?

Delta Lake houdt een transaction log (Delta Log) bij die alle wijzigingen registreert. Je kunt data queryen op een specifieke versie nummer of tijdstip met VERSION AS OF of TIMESTAMP AS OF syntax.

Is Delta Lake alleen voor Databricks?

Delta Lake is open source en werkt op alle Apache Spark distributions. Hoewel ontwikkeld door Databricks, kun je Delta Lake ook gebruiken op AWS EMR, Azure HDInsight, Google Dataproc en standalone Spark clusters.

Delta Lake: De Foundation voor Moderne Data Lakehouses

Van Data Lake naar Data Lakehouse

Delta Lake is een open-source storage layer die betrouwbaarheid toevoegt aan data lakes. Het combineert de schaalbaarheid en kostenefficiëntie van data lakes met de ACID-guarantees en data management features van data warehouses.

Het Probleem met Traditionele Data Lakes

Traditionele data lakes (op Parquet, ORC, Avro) hebben verschillende problemen:

Geen ACID transactions: Concurrent writes veroorzaken data corruption
Geen schema enforcement: Dirty data komt in de lake
Geen time travel: Onmogelijk om naar vorige versies te gaan
Slechte performance: Geen indexing, caching, of data skipping
Complex onderhoud: Handmatig vacuum en compact nodig

Delta Lake lost al deze problemen op met een transaction log en metadata management.

De 5 Kern Features van Delta Lake

ACID Transactions

Atomicity: Alles of niets commits

Consistency: Schema enforcement

Isolation: Serializable isolation level

Durability: Transaction log garandeert

Time Travel

Data versioning: Volledige historie

Audit trail: Wie deed wat wanneer

Rollback: Naar eerdere versies gaan

Reproduceerbaar: Exacte data op moment X

Schema Evolution

Schema enforcement: Voorkomt dirty data

Schema evolution: Voeg columns toe veilig

Schema inference: Automatisch van data

Data quality: Validation constraints

Performance Optimizations

Data skipping: Min/max statistics

Z-ordering: Multi-dimensional clustering

Caching: Delta cache voor snel lezen

Compaction: Automatisch kleine files

Streaming + Batch Unification

Single source: Batch en streaming

Incremental processing: Alleen nieuwe data

Change Data Feed: Track wijzigingen

Unified pipeline: ETL en streaming

Delta Lake Architectuur: Hoe Het Werkt

De Delta Lake Stack

Laag	Component	Functie	Voorbeeld
Storage	Cloud Object Storage	Data persistence	Azure Data Lake, Amazon S3
Transaction Layer	Delta Log (JSON)	ACID guarantees, metadata	_delta_log/ folder
Data Format	Parquet Files	Columnar storage	.parquet files
Processing Engine	Apache Spark	Distributed processing	Spark SQL, DataFrames
APIs	Delta Lake APIs	CRUD operations	Python, Scala, SQL

Delta Log: Het Hart van Delta Lake

De Delta Log is een transaction log die alle wijzigingen bijhoudt in JSON files:


/delta/customers/
├── _delta_log/
│   ├── 00000000000000000000.json  # Create table
│   ├── 00000000000000000001.json  # Insert 1000 rows
│   ├── 00000000000000000002.json  # Update 50 rows
│   └── 00000000000000000003.json  # Delete 10 rows
├── part-00000-*.parquet           # Data file v0
├── part-00001-*.parquet           # Data file v1
└── part-00002-*.parquet           # Data file v3

-- Elke JSON entry:
{
  "add": {
    "path": "part-00000-xxx.parquet",
    "size": 123456,
    "modificationTime": 1609459200000,
    "dataChange": true,
    "stats": "{\"numRecords\":1000,\"minValues\":...}"
  }
}

Key inzichten: De Delta Log maakt time travel mogelijk door alle versies bij te houden. Optimizations zoals VACUUM en OPTIMIZE werken met deze log.

Praktische Delta Lake Implementatie

Stap 1: Delta Table Maken

1

Python (PySpark)


from pyspark.sql import SparkSession
from delta import *

# Spark sessie met Delta configuratie
spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions", 
            "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Maak Delta table
df = spark.createDataFrame([
    (1, "Alice", "Amsterdam", 50000),
    (2, "Bob", "Rotterdam", 60000),
    (3, "Charlie", "Utrecht", 55000)
], ["id", "name", "city", "salary"])

df.write.format("delta") \
    .mode("overwrite") \
    .save("/delta/customers")

2

SQL (Spark SQL)


-- Maak Delta table met SQL
CREATE TABLE customers (
    id INT,
    name STRING,
    city STRING,
    salary DECIMAL(10,2)
) USING DELTA
LOCATION '/delta/customers';

-- Of converteer bestaande table
CONVERT TO DELTA parquet.`/data/customers`;

-- Check table details
DESCRIBE HISTORY customers;
DESCRIBE DETAIL customers;

Stap 2: CRUD Operations

1

Updates & Deletes (ACID)


-- Update (traditionele data lakes kunnen dit niet)
UPDATE customers 
SET salary = salary * 1.1 
WHERE city = 'Amsterdam';

-- Delete met condition
DELETE FROM customers 
WHERE salary < 30000;

-- Merge/Upsert operation
MERGE INTO customers target
USING updates source
ON target.id = source.id
WHEN MATCHED THEN 
    UPDATE SET target.salary = source.salary
WHEN NOT MATCHED THEN 
    INSERT (id, name, city, salary) 
    VALUES (source.id, source.name, 
            source.city, source.salary);

2

Time Travel Queries


-- Lees specifieke versie
SELECT * FROM customers 
VERSION AS OF 2;

-- Lees op timestamp
SELECT * FROM customers 
TIMESTAMP AS OF '2024-12-20 10:00:00';

-- Vergelijk versies
SELECT * FROM 
(SELECT *, 'current' as version FROM customers) current
FULL OUTER JOIN
(SELECT *, 'previous' as version 
 FROM customers VERSION AS OF 2) previous
ON current.id = previous.id
WHERE current.salary != previous.salary 
   OR current.salary IS NULL 
   OR previous.salary IS NULL;

-- Restore naar vorige versie
RESTORE TABLE customers TO VERSION AS OF 2;

Stap 3: Performance Optimizations

1

Z-Ordering & Compaction


-- OPTIMIZE: Compact small files
OPTIMIZE customers
WHERE city = 'Amsterdam';

-- Z-ORDER: Co-locate related data
OPTIMIZE customers
ZORDER BY (city, id);

-- Auto compaction configuratie
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

-- Check table metrics
DESCRIBE DETAIL customers;
-- numFiles, sizeInBytes, etc.

2

Data Skipping & Indexing


-- Delta Lake gebruikt automatisch statistics
-- Geen separate index nodig

-- Check query plan met data skipping
EXPLAIN SELECT * FROM customers 
WHERE city = 'Amsterdam' AND salary > 50000;

-- Bloom filters (experimental)
ALTER TABLE customers 
SET TBLPROPERTIES (
    'delta.bloomFilter.columns' = 'city,id',
    'delta.bloomFilter.fpp' = '0.1'
);

-- Generated columns voor performance
ALTER TABLE customers 
ADD COLUMN city_lower STRING 
GENERATED ALWAYS AS (LOWER(city));

Schema Management & Evolution

Schema Enforcement


-- Strict schema (default)
spark.conf.set(
    "spark.databricks.delta.schema.autoMerge.enabled", 
    "false"
);

-- Probeer foute data te schrijven
df_bad = spark.createDataFrame([
    (4, "Diana", "Den Haag", "high")  # salary als string!
], ["id", "name", "city", "salary"])

-- Dit faalt met:
-- AnalysisException: A schema mismatch detected...
df_bad.write.format("delta") \
    .mode("append") \
    .save("/delta/customers")

Schema Evolution


-- Sta schema evolution toe
spark.conf.set(
    "spark.databricks.delta.schema.autoMerge.enabled", 
    "true"
);

-- Of tijdens schrijven
df_new = spark.createDataFrame([
    (5, "Erik", "Eindhoven", 65000, "erik@email.nl")
], ["id", "name", "city", "salary", "email"])

df_new.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/delta/customers")

-- Manuele schema wijziging
ALTER TABLE customers 
ADD COLUMN department STRING AFTER city;

ALTER TABLE customers 
CHANGE COLUMN salary salary DECIMAL(12,2);

Data Quality Constraints


-- Add constraints
ALTER TABLE customers 
ADD CONSTRAINT salary_positive 
CHECK (salary > 0);

ALTER TABLE customers 
ADD CONSTRAINT valid_city 
CHECK (city IN ('Amsterdam', 'Rotterdam', 
                'Utrecht', 'Den Haag', 
                'Eindhoven'));

-- Constraint violation faalt
UPDATE customers 
SET salary = -100 
WHERE id = 1;
-- DeltaInvariantViolationException

-- Check constraints
SHOW CONSTRAINTS customers;

-- Drop constraint
ALTER TABLE customers 
DROP CONSTRAINT salary_positive;

Streaming & Change Data Capture

Real-time Data Pipeline met Delta Lake

1

Streaming Reads & Writes


# Read stream from Kafka
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "customers") \
    .load()

# Parse JSON
from pyspark.sql.functions import from_json, col
schema = "id INT, name STRING, city STRING, salary DECIMAL(10,2)"
parsed_df = stream_df.select(
    from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")

# Write to Delta Lake (streaming)
query = parsed_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/delta/checkpoints/customers") \
    .start("/delta/customers_stream")

# Batch queries op streaming table
spark.read.format("delta") \
    .load("/delta/customers_stream") \
    .createOrReplaceTempView("customers_stream")

SELECT COUNT(*) as live_count 
FROM customers_stream;

2

Change Data Feed (CDF)


-- Enable Change Data Feed
ALTER TABLE customers 
SET TBLPROPERTIES (
    delta.enableChangeDataFeed = true
);

-- Read changes
SELECT * FROM table_changes('customers', 2, 4)

-- Of via Python
spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 2) \
    .option("endingVersion", 4) \
    .load("/delta/customers") \
    .show()

-- CDC voor incremental processing
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingTimestamp", "2024-12-20") \
    .load("/delta/customers")

-- Alleen nieuwe/gewijzigde rijen
changes_df.filter(
    (col("_change_type") == "insert") | 
    (col("_change_type") == "update_postimage")
).write.format("delta") \
    .mode("append") \
    .save("/delta/customers_silver")

Delta Lake Medallion Architecture

 Bronze → Silver → Gold
              Bronze Layer (Raw)
              Functie: Landing zone
Data: Raw, immutable
Schema: As-is van bron
Opslag: Delta tables
Voorbeeld: JSON logs, CDC streams

              
-- Create bronze table
CREATE TABLE bronze_customers
USING DELTA
LOCATION '/delta/bronze/customers'
AS SELECT * FROM customers_raw;
              
            

              Silver Layer (Cleaned)
              Functie: Enriched data
Data: Cleaned, validated
Schema: Enterprise model
Opslag: Delta tables
Voorbeeld: Merged, deduplicated

              
-- Clean and enrich
CREATE TABLE silver_customers
USING DELTA
LOCATION '/delta/silver/customers'
AS SELECT 
    id,
    TRIM(name) as name,
    UPPER(city) as city,
    CAST(salary AS DECIMAL(10,2)) as salary
FROM bronze_customers
WHERE id IS NOT NULL;
              
            

              Gold Layer (Business)
              Functie: Business ready
Data: Aggregated, modeled
Schema: Star schema
Opslag: Delta tables
Voorbeeld: Data marts, aggregates

              
-- Business aggregates
CREATE TABLE gold_customer_summary
USING DELTA
LOCATION '/delta/gold/customer_summary'
AS SELECT 
    city,
    COUNT(*) as customer_count,
    AVG(salary) as avg_salary,
    SUM(salary) as total_salary
FROM silver_customers
GROUP BY city;
              
            

Platform Integration & Ecosystem

Azure Integration

Azure Databricks: Native Delta support
Azure Synapse: Serverless SQL pools
Azure Data Factory: Copy activity
Power BI: DirectQuery op Delta
Azure Purview: Data lineage

AWS Integration

Databricks on AWS: S3 integration
AWS Glue: ETL jobs met Delta
Amazon Athena: Federated queries
EMR: Spark clusters met Delta
Redshift Spectrum: External tables

Tools & Frameworks

dbt: Delta adapter
Apache Airflow: Orchestration
Great Expectations: Data quality
MLflow: Experiment tracking
Delta Live Tables: Pipeline orchestration

Performance Benchmarks

Operation	Parquet	Delta Lake	Improvement	Reden
UPDATE query	Rewrite full table	Only changed files	10-100x sneller	Incremental processing
SELECT with filter	Full scan	Data skipping	5-20x sneller	Min/max statistics
Concurrent writes	Fails/corruption	ACID guaranteed	100% reliable	Transaction log
Time travel query	Not possible	Instant	N/A	Delta Log tracking
Small file problem	Manual compaction	Auto OPTIMIZE	90% minder werk	Automatic management

Delta Lake Best Practices

Do's

Gebruik OPTIMIZE regelmatig
Implementeer Z-ORDER op filter columns
Gebruik medallion architecture
Stel VACUUM retention goed in
Monitor table statistics
Gebruik constraints voor data quality
Enable Change Data Feed voor CDC

Don'ts

Schrijf niet direct naar Parquet
Laat niet te veel kleine files
Gebruik niet te veel Z-ORDER columns
Vergeet checkpoint niet bij streaming
Schakel schema enforcement niet uit
Gebruik niet te korte VACUUM retention
Mix niet Delta met non-Delta tables

Monitoring & Maintenance

Monitor numFiles en sizeInBytes
Check query performance metrics
Review Delta Log groei
Set up alerts voor failures
Regular VACUUM van oude versies
Document schema changes
Test time travel regelmatig

Conclusie & Toekomst van Data Lakehouses

 Wanneer Delta Lake Gebruiken?
              ✅ Perfect voor...
              ETL pipelines met updates/deletes
Real-time + batch unification
Data lakes met ACID nodig
Time travel & audit requirements
Schema evolution nodig
Lakehouse architecture

            

              ⚠️ Overwegen bij...
              Append-only workloads
Zeer kleine datasets (< 1GB)
Read-only access patterns
Geen Spark ecosystem
Eenvoudige file exports nodig

            

              🚀 Future Trends
              Delta Lake 3.0: UniForm format
Delta Sharing: Open sharing
Delta Live Tables: Simplification
GPU acceleration
Serverless query engines

            

Implementatie Checklist

1

Foundation Setup

Kies cloud storage (S3, ADLS, GCS)
Setup Spark cluster met Delta
Configureer medallion directories
Implementeer data governance

2

Pipeline Development

Bouw bronze → silver → gold pipelines
Implementeer schema management
Configureer streaming workloads
Setup monitoring en alerting

3

Optimization & Scale

Implementeer OPTIMIZE schedules
Configureer Z-ORDER strategie
Setup auto-compaction
Monitor en tune performance

Finale aanbeveling: Begin met een pilot project om Delta Lake features te testen. Focus eerst op ACID transactions en time travel, dan op performance optimizations. Delta Lake transformeert data lakes van "data swamps" naar enterprise-grade data platforms.

Delta Lake: De Foundation voor Moderne Data Lakehouses

Van Data Lake naar Data Lakehouse

Het Probleem met Traditionele Data Lakes

De 5 Kern Features van Delta Lake

ACID Transactions

Time Travel

Schema Evolution

Performance Optimizations

Streaming + Batch Unification

Delta Lake Architectuur: Hoe Het Werkt

De Delta Lake Stack

Delta Log: Het Hart van Delta Lake

Praktische Delta Lake Implementatie

Stap 1: Delta Table Maken

Python (PySpark)

SQL (Spark SQL)

Stap 2: CRUD Operations

Updates & Deletes (ACID)

Time Travel Queries

Stap 3: Performance Optimizations

Z-Ordering & Compaction

Data Skipping & Indexing

Schema Management & Evolution

Schema Enforcement

Schema Evolution

Data Quality Constraints

Streaming & Change Data Capture

Real-time Data Pipeline met Delta Lake

Streaming Reads & Writes

Change Data Feed (CDF)

Delta Lake Medallion Architecture

Bronze → Silver → Gold

Bronze Layer (Raw)

Silver Layer (Cleaned)

Gold Layer (Business)

Platform Integration & Ecosystem

Azure Integration

AWS Integration

Tools & Frameworks

Performance Benchmarks

Delta Lake Best Practices

Do's

Don'ts

Monitoring & Maintenance

Conclusie & Toekomst van Data Lakehouses

Wanneer Delta Lake Gebruiken?

✅ Perfect voor...

⚠️ Overwegen bij...

🚀 Future Trends

Implementatie Checklist

Foundation Setup

Pipeline Development

Optimization & Scale

Delta Lake Experts Nodig?

Deel dit artikel

Gerelateerde Artikelen:

Over de auteur