Apache Iceberg Lakehouse
Build data lakehouses with Apache Iceberg — open table format with ACID transactions, time travel, schema evolution, and hidden partitioning on S3/GCS. Covers PyIceberg, Spark, Trino, Polaris/Lakekeeper catalogs.
This skill helps you adopt Iceberg as your lakehouse table format. It writes PyIceberg code for table creation/append/upsert, configures Spark and Trino as compute engines, sets up REST catalogs (Polaris, Lakekeeper, Nessie, Glue), implements schema evolution and partition evolution, runs maintenance (compaction, snapshot expiration, orphan file cleanup), and migrates from Delta Lake or Hive. Covers time-travel queries, branching with Nessie, and integration with Snowflake's external Iceberg tables.
When to use
Use when building a lakehouse architecture on S3/GCS, migrating from Hive or Delta Lake to an open format, needing time-travel and schema evolution on petabyte-scale data, or integrating multiple compute engines on the same tables.
Examples
PyIceberg pipeline
Build a PyIceberg ingestion pipeline
Build a PyIceberg pipeline that ingests daily CSV files into an S3-backed Iceberg table with Glue catalog, with upsert semantics on customer_id
Migrate from Delta Lake
Convert Delta tables to Iceberg
Migrate our Delta Lake tables to Iceberg using the snapshot procedure, set up Polaris REST catalog, and verify Spark and Trino both query the same tables