What is the difference between Apache Iceberg and Delta Lake?

Both are open table formats, but Iceberg is engine-agnostic (works with Spark, Flink, Trino, Snowflake and BigQuery simultaneously). Delta Lake was created by Databricks and has deeper Spark integration. Iceberg offers better partition evolution and is more vendor-neutral.

Which platforms support Apache Iceberg?

Apache Iceberg is supported by AWS (Athena, Glue, EMR), Google BigQuery, Snowflake, Databricks, Apache Spark, Apache Flink, Trino, Dremio and virtually all major cloud data platforms in 2026.

Apache Iceberg: The Open Table Standard for Data Lakes

Q: What is Apache Iceberg?

Apache Iceberg is an open table format for large analytical datasets. It adds ACID transactions, schema evolution, time travel and partition evolution on top of existing cloud storage such as S3, ADLS and GCS. It works with multiple query engines at the same time.

What is Apache Iceberg?

Apache Iceberg is an open-source table format for large analytical datasets in the cloud. Originally developed by Netflix and Apple, it is now an Apache top-level project backed by all the major cloud vendors.

The core promise of Iceberg

Iceberg solves the fundamental problem of data lakes: how do you manage millions of files on S3 or ADLS as one coherent, reliable table — without locking yourself into a single query engine?

ACID transactions: Safe concurrent reads and writes
Time travel: Query data at any historical point
Schema evolution: Add columns without rewriting data
Partition evolution: Change partitioning without migration
Multi-engine: Use Spark, Flink, Trino and Snowflake on the same data at the same time

In 2026, Iceberg is no longer "new" — it has become the industry standard. AWS Athena, Google BigQuery, Snowflake and Databricks all support it natively. If you are building a new data platform today, Iceberg is the logical choice.

How Does Apache Iceberg Work Under the Hood?

Iceberg stores data as ordinary Parquet (or ORC/Avro) files, but adds a metadata layer on top that keeps track of everything.

Data Files

The actual data: Parquet, ORC or Avro files on S3, ADLS or GCS. Iceberg does not write these any differently — it is just cloud storage.

Manifest Files

Lists of data files with statistics (min/max per column). Query engines use this for data pruning — skipping files that are not relevant.

Manifest List

A snapshot of the table at a given moment: which manifest files belong to this version? This is what makes time travel possible.

Metadata File

The master file: schema, partitioning, snapshot history, statistics. This is the starting point for every query engine.

Snapshot-based Architecture

Every write (INSERT, UPDATE, DELETE) creates a new snapshot. Old snapshots are retained until you explicitly remove them with expire_snapshots(). This gives you:

Atomic commits: A write either succeeds fully or not at all
Concurrent reads: Readers always see a consistent snapshot
Time travel: Query any historical snapshot
Rollback: Revert a bad write in a single command

Apache Iceberg vs Delta Lake vs Apache Hudi

There are three major open table formats. Here is the honest comparison:

Feature	Apache Iceberg	Delta Lake	Apache Hudi
Origin	Netflix / Apple	Databricks	Uber
ACID transactions	✅ Yes	✅ Yes	✅ Yes
Time travel	✅ Yes (snapshots)	✅ Yes (versions)	⚠️ Limited
Multi-engine support	✅ Best (Spark, Flink, Trino, Snowflake, BigQuery)	⚠️ Growing (via UniForm)	⚠️ Spark-focused
Partition evolution	✅ Yes (without migration)	❌ No	❌ No
Schema evolution	✅ Full	✅ Full	✅ Limited
Vendor neutrality	✅ Maximum	⚠️ Databricks-leaning	✅ Good
AWS native support	✅ Athena, Glue, EMR	⚠️ Via Databricks/EMR	✅ EMR
Snowflake support	✅ Native	⚠️ Via UniForm (limited)	❌ None

The Winner in 2026?

For multi-cloud or multi-engine environments, Iceberg is the clear winner thanks to its broad adoption. If you are fully on Databricks, Delta Lake is still perfectly fine. Hudi is strong for high-frequency upserts (CDC use cases).

The Most Powerful Iceberg Features Explained

1

Partition Evolution — Unique to Iceberg

This is the feature that sets Iceberg apart. With Delta Lake and Hudi you have to rewrite the entire table when you change partitioning. With Iceberg you can change the partitioning strategy without moving a single byte of data.

-- Start with partitioning by day
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(10,2),
  order_date TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(order_date));

-- Later grow to hourly (without rewriting the table!)
ALTER TABLE orders
ADD PARTITION FIELD hours(order_date);

-- Iceberg writes new data in the new partition structure,
-- old data stays unchanged. Query engines understand both.

2

Time Travel & Rollback

Query historical data or recover accidentally deleted records:

-- Query yesterday's table (Spark SQL)
SELECT * FROM orders
TIMESTAMP AS OF '2026-03-20 09:00:00';

-- Query a specific snapshot
SELECT * FROM orders VERSION AS OF 4521;

-- Roll back to a previous snapshot
CALL catalog.system.rollback_to_snapshot('db.orders', 4521);

-- View snapshot history
SELECT * FROM orders.snapshots;

3

Schema Evolution Without Downtime

Add, rename or drop columns — safely and without rewriting data:

-- Add a column (takes effect immediately, no rewrite)
ALTER TABLE orders ADD COLUMN discount DECIMAL(5,2);

-- Rename a column
ALTER TABLE orders RENAME COLUMN amount TO order_amount;

-- Reorder a column
ALTER TABLE orders ALTER COLUMN discount AFTER order_amount;

-- Drop a column (data remains, but is no longer read)
ALTER TABLE orders DROP COLUMN legacy_flag;

4

Row-level Deletes & Updates (Copy-on-Write vs Merge-on-Read)

Iceberg supports two strategies for updates and deletes:

-- Copy-on-Write (CoW): rewrites files on every update
-- → Faster reads, slower writes
-- Good for: batch updates, reporting tables

-- Merge-on-Read (MoR): stores delete/update markers
-- → Faster writes, reads require a merge
-- Good for: frequent CDC / streaming updates

-- MERGE INTO (upsert) — works in Spark, Trino, Flink
MERGE INTO orders t
USING updates s ON t.order_id = s.order_id
WHEN MATCHED AND s.status = 'cancelled' THEN DELETE
WHEN MATCHED THEN UPDATE SET t.amount = s.amount, t.status = s.status
WHEN NOT MATCHED THEN INSERT *;

Multi-engine: Iceberg's Biggest Trump Card

The biggest advantage of Iceberg over the alternatives is that multiple engines can read and write the same table at the same time without coordination:

AWS Stack

AWS Glue: ETL jobs on Iceberg tables
Amazon Athena: SQL queries without a cluster
Amazon EMR: Spark/Hive/Flink on Iceberg
AWS Lake Formation: Governance + Iceberg
Amazon Redshift: Spectrum on Iceberg tables

Azure / Microsoft

Azure Databricks: Native Iceberg support
Microsoft Fabric: OneLake + Iceberg
Azure HDInsight: Spark on Iceberg
Azure Synapse: Via Spark pools

Google Cloud

BigQuery: Native Iceberg tables (BigLake)
Dataproc: Spark + Iceberg
Dataflow: Beam pipelines on Iceberg

Open Source Engines

Apache Spark: Best integration
Apache Flink: Streaming into Iceberg
Trino / Presto: Ad-hoc SQL
DuckDB: Local queries on Iceberg
Dremio: BI acceleration on Iceberg

Real-world Scenario: Netflix's Iceberg Setup

Netflix, the inventor of Iceberg, uses it like this:

Flink: Real-time streaming writes events into Iceberg (millions per second)
Spark: Batch ETL transformations and compaction jobs
Trino: Data analysts query the data ad-hoc via SQL
The same S3 tables — no data copying or syncing

Result: one platform for streaming and batch, with full SQL support and no vendor lock-in.

Getting Started with Apache Iceberg

1

Test locally with Spark + Iceberg

# pip install pyspark
# Download the Iceberg Spark runtime JAR

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.jars.packages",
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
    .getOrCreate()

# Create a table
spark.sql("""
  CREATE TABLE local.db.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
  )
  USING iceberg
  PARTITIONED BY (months(order_date))
""")

# Insert data
spark.sql("""
  INSERT INTO local.db.orders VALUES
    (1, 101, 99.95, DATE '2026-03-01'),
    (2, 102, 249.00, DATE '2026-03-15'),
    (3, 101, 59.99, DATE '2026-03-20')
""")

# Query with time travel
spark.sql("SELECT * FROM local.db.orders.snapshots").show()
spark.sql("SELECT * FROM local.db.orders").show()

2

Iceberg on AWS: Glue + Athena

-- In AWS Glue Data Catalog / Athena:

-- Create a table as Iceberg
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DOUBLE,
  order_date DATE
)
LOCATION 's3://my-datalake/orders/'
TBLPROPERTIES (
  'table_type' = 'ICEBERG',
  'format' = 'parquet',
  'write_compression' = 'snappy'
);

-- Insert data (no S3 copy needed!)
INSERT INTO orders VALUES (1, 101, 99.95, DATE '2026-03-21');

-- UPDATE (Athena engine v3 + Iceberg)
UPDATE orders
SET amount = 89.95
WHERE order_id = 1;

-- DELETE
DELETE FROM orders WHERE order_id = 1;

-- Time travel
SELECT * FROM orders
FOR SYSTEM_TIME AS OF TIMESTAMP '2026-03-20 12:00:00';

3

Migrating from Parquet to Iceberg (in-place)

-- Migrate an existing Parquet table to Iceberg
-- WITHOUT copying data (only create metadata)

-- In Spark:
from pyspark.sql import SparkSession

# Migrate an existing Parquet table
spark.sql("""
  CALL catalog.system.migrate(
    'db.existing_parquet_table'
  )
""")

-- Or via snapshot (safer: the original stays intact)
spark.sql("""
  CALL catalog.system.snapshot(
    source_table => 'parquet.db.orders',
    table => 'iceberg.db.orders'
  )
""")

-- Validate the migration
spark.sql("SELECT COUNT(*) FROM iceberg.db.orders").show()
spark.sql("SELECT * FROM iceberg.db.orders.snapshots").show()

4

Table Maintenance: Compaction & Cleanup

-- Compact small files together (compaction)
CALL catalog.system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'order_date ASC'
);

-- Remove expired snapshots (data cleanup)
CALL catalog.system.expire_snapshots(
  table => 'db.orders',
  older_than => TIMESTAMP '2026-03-01 00:00:00',
  retain_last => 5
);

-- Remove unused metadata and orphan files
CALL catalog.system.remove_orphan_files(
  table => 'db.orders',
  older_than => TIMESTAMP '2026-03-01 00:00:00'
);

-- View table statistics
SELECT * FROM db.orders.files LIMIT 20;
SELECT * FROM db.orders.partitions;

Iceberg Catalogs: The Central Point

A catalog is how Iceberg keeps track of which tables exist and where their metadata lives. The choice of catalog is crucial to your architecture:

Catalog Type	When to use	Examples	Governance
AWS Glue Catalog	AWS environments	Athena, EMR, Glue Jobs	Lake Formation
Hive Metastore	On-premise or migration	Spark, Hive, Trino	Ranger / Kerberos
Nessie	Git-like data versioning	Dremio, Spark	Branch-based
REST Catalog	Multi-cloud / vendor-neutral	Tabular, Snowflake Open Catalog	API-based
Unity Catalog	Databricks ecosystem	Databricks, Delta/Iceberg via UniForm	Column-level

Recommendation: REST Catalog as a Future Strategy

The industry is moving toward an open REST Catalog API (the Apache Iceberg REST spec). This gives you maximum portability: switch query engine or cloud provider without migrating your catalog. Snowflake Open Catalog (based on Polaris) and Tabular are the frontrunners here.

Best Practices for Production

Partitioning Strategy

Use hidden partitioning: Iceberg adds partition transforms (days, months, hours, bucket, truncate)
Start conservatively: prefer partitions that are too large over too small
Avoid high-cardinality columns as a partition key
Use bucket(N, id) for evenly-distributed data

File Size Management

Target file size: 128MB – 512MB per Parquet file
Schedule daily compaction jobs outside peak hours
Use rewrite_data_files with sort for better query performance
Monitor db.table.files for file count trends

Snapshot Retention

Keep snapshots for at least 7 days to recover from mistakes
Set history.expire.min-snapshots-to-keep = 5
Automate expire_snapshots and remove_orphan_files
Monitor S3 costs: old snapshots count too

Query Performance

Use Z-ordering (sort order) on columns you filter on
Enable bloom filter indexes for high-cardinality lookups
Write metadata via write.metadata.metrics.column
Prefer positional deletes over equality deletes (faster reads)

Conclusion: When Should You Choose Apache Iceberg?

 Perfect for Iceberg if...
              ✅ Choose Iceberg
              You use multiple query engines (Spark + Trino + Athena)
You want no vendor lock-in (e.g. not purely Databricks)
You are on AWS and use Glue/Athena
You want Snowflake and Spark on the same data
You are building a new platform from scratch
You need partition evolution (growing data)

            

              ⚠️ Stick with Delta Lake if...
              You are fully on Databricks and happy with it
You use Delta Live Tables (DLT)
Your team is already deep in the Delta ecosystem
You need Unity Catalog with fine-grained access control

            

              🚀 The Future
              Databricks UniForm: Delta and Iceberg at the same time
Open REST Catalog: universal interoperability
Apache XTable: automatic format bridging
All major vendors are moving toward Iceberg compatibility

            

Apache Iceberg is no longer a niche choice — it is the safest long-term investment for your data platform in 2026. The broad vendor adoption guarantees you can always deploy the best tools without migrating data again.

Apache Iceberg: The Open Table Standard Transforming Data Lakes