Skills / Data / Apache Iceberg Lakehouse

Apache Iceberg Lakehouse

Build data lakehouses with Apache Iceberg — open table format with ACID transactions, time travel, schema evolution, and hidden partitioning on S3/GCS. Covers PyIceberg, Spark, Trino, Polaris/Lakekeeper catalogs.

This skill helps you adopt Iceberg as your lakehouse table format. It writes PyIceberg code for table creation/append/upsert, configures Spark and Trino as compute engines, sets up REST catalogs (Polaris, Lakekeeper, Nessie, Glue), implements schema evolution and partition evolution, runs maintenance (compaction, snapshot expiration, orphan file cleanup), and migrates from Delta Lake or Hive. Covers time-travel queries, branching with Nessie, and integration with Snowflake's external Iceberg tables.

iceberg lakehouse data spark trino

When to use

Use when building a lakehouse architecture on S3/GCS, migrating from Hive or Delta Lake to an open format, needing time-travel and schema evolution on petabyte-scale data, or integrating multiple compute engines on the same tables.

Examples

PyIceberg pipeline

Build a PyIceberg ingestion pipeline

Build a PyIceberg pipeline that ingests daily CSV files into an S3-backed Iceberg table with Glue catalog, with upsert semantics on customer_id

Migrate from Delta Lake

Convert Delta tables to Iceberg

Migrate our Delta Lake tables to Iceberg using the snapshot procedure, set up Polaris REST catalog, and verify Spark and Trino both query the same tables
Added to wishlist