by datastudy.nl

Field notes for enterprise data engineers and scientists

Engineering

Should you put your data lake in Snowflake Iceberg tables?

Snowflake Iceberg tables store data in open Apache Iceberg format in your own cloud storage, so Snowflake bills zero storage and other engines can read it. The catch: only Snowflake-managed catalog tables get full platform support, external-catalog tables lose clustering, cloning, and replication.

Bar chart comparing Snowflake platform features supported: about 9 for a Snowflake-managed Iceberg catalog versus 3 for an external catalog
Illustrative: a Snowflake-managed Iceberg catalog keeps most platform features (around 9) while an external catalog supports far fewer (around 3). Snowflake Iceberg tables documentation.

The question landing in every enterprise data architecture review right now: do we keep paying Snowflake to store our data, or do we put it in open Apache Iceberg tables and keep one copy that every engine can read? Iceberg is an open table format that adds database features, ACID transactions, schema evolution, hidden partitioning, and time-travel snapshots, on top of plain Parquet files sitting in your own S3, GCS, or Azure storage. Snowflake supports it natively. The headline that makes finance lean in: with an external volume, Iceberg tables incur zero Snowflake storage cost, because your cloud provider bills you for the bytes directly. The headline that should make architects slow down: not all Iceberg tables are equal, and the catalog you choose decides how much of Snowflake you actually get to keep.

This guide is for the data engineer or platform lead weighing a lakehouse move on Snowflake: when Iceberg is the right call, what you give up, and how to avoid the interoperability promise quietly costing you more than it saves.

What problem does Iceberg actually solve?

A standard Snowflake table stores data in Snowflake's proprietary format. It is fast and fully managed, but the data is locked behind Snowflake's engine: Spark, Trino, or Databricks cannot read it without exporting a copy. The moment two teams want two engines on the same data, you are either copying it or paying twice.

Iceberg breaks that lock. The table data lives as Parquet in open cloud storage, and an Iceberg catalog tracks which files make up the current table state. Any Iceberg-aware engine can read it. So the real question is not "Snowflake or open format" but who owns the catalog, because Snowflake supports two very different models and they are not interchangeable.

  • Snowflake as the catalog (Snowflake-managed): Snowflake owns the metadata pointer, handles maintenance like compaction, and gives you read and write with full platform support. The data still sits in your external volume, so storage stays cheap, but Snowflake behaves almost like it is a native table.
  • External catalog (AWS Glue, Databricks Unity Catalog, a remote Iceberg REST catalog): another system owns the metadata. Snowflake connects through a catalog integration and gives you limited platform support. Snowflake does not manage the table lifecycle here.

How much of Snowflake do you give up with an external catalog?

This is the trade-off the marketing glosses over, and it is the single most important thing to get right. Going Snowflake-managed keeps nearly the whole platform. Going external-catalog trades platform features for ecosystem neutrality.

Bar chart of Snowflake platform features supported: about 9 for a Snowflake-managed catalog versus 3 for an external catalog
Illustrative: a Snowflake-managed Iceberg catalog retains most platform features while an external catalog supports a much smaller subset. Source: Snowflake Iceberg tables documentation, considerations and limitations.

Per Snowflake's own documentation, here is what the catalog choice actually costs you:

Capability Snowflake-managed catalog External catalog
Read and write Full Full (writes supported via Iceberg REST)
Clustering keys Supported Not supported
Replication Supported Not supported
Cloning Supported Not supported (externally managed)
Standard streams (CDC) Supported Insert-only streams only
Lifecycle maintenance (compaction) Snowflake handles it You own it in the external engine
Query from other engines Sync to Snowflake Open Catalog Native, that is the point

The pattern that follows: if Snowflake is your primary engine and you just want cheap open storage, use Snowflake as the catalog and sync to Open Catalog when another engine needs read access. If a Spark or Databricks platform is the system of record and Snowflake is a guest, use the external catalog and accept that clustering, cloning, and replication are off the table. Snowflake even supports a catalog-linked database that stays in sync with a remote Iceberg REST catalog, including bidirectional access to Databricks Unity Catalog, so the two platforms can share one copy of the data.

What does it really cost, and where does the bill hide?

The storage saving is real but narrower than it sounds. Snowflake bills you for virtual warehouse compute and cloud services whenever you query or maintain Iceberg tables, exactly as it does for native tables. What changes is storage.

Bar chart of Snowflake storage bill: 100 for a standard table, 0 for Iceberg on an external volume, 100 for Iceberg on Snowflake-managed storage
Illustrative: with an external volume, Snowflake charges no storage for Iceberg, your cloud provider bills you instead; with Snowflake-managed storage, Snowflake charges storage as normal. Source: Snowflake Iceberg tables billing documentation.

The nuance that catches teams out: storage is free from Snowflake only when you use your own external volume. If you pick the convenience option of EXTERNAL_VOLUME = SNOWFLAKE_MANAGED storage, Snowflake charges for storage just like a normal table, and the headline saving evaporates. And the most common surprise bill is geography. If your Snowflake account and your external volume sit in different regions, every query triggers cross-region egress that your cloud provider charges you for, and Snowflake adds cross-region data-transfer usage on top for managed tables. Keep compute and storage in the same region or that interoperability dream turns into an egress line item.

Two more honest caveats before you migrate a critical table:

  • No Fail-safe. Iceberg tables on an external volume get no Snowflake Fail-safe recovery. You own data protection and recovery for that storage. For a standard table, Snowflake's seven-day Fail-safe is a safety net you are quietly giving up.
  • Maintenance is real work. For externally managed tables, you own compaction and cleanup. Excessive position deletes can actually block table creation and refresh, and orphan files from failed writes can make your storage bill drift above what Snowflake reports. This is operational overhead a native table never asked of you.

When is Iceberg the right call, and when is it a trap?

Reach for Iceberg when you have a genuine multi-engine reality or an existing data lake you cannot or will not move into Snowflake. The open format pays off precisely when more than one tool needs the same data and the alternative is copies that drift. Snowflake says it plainly: Iceberg tables are ideal for existing data lakes you choose not to store in Snowflake.

Do not reach for Iceberg when Snowflake is your only engine and you just heard "open format" in a keynote. For a single-engine shop, a Snowflake-managed Iceberg table on an external volume buys you cheaper storage and an exit option, which is a reasonable hedge, but a standard table buys you Fail-safe, zero maintenance, and every feature with no asterisks. The worst outcome is adopting an external catalog for openness you never use, then discovering six months later that you cannot cluster your biggest table or replicate it for disaster recovery.

If cost is the driver, weigh the storage saving against the maintenance time and egress risk before you commit, the same discipline from the warehouse sizing guide: measure the real number, do not assume the cheap-looking option is cheaper once you count everything. The rest of the Snowflake guides cover the compute side that Iceberg does not change.

The decision in one line

Pick the catalog before you pick the format. Snowflake-managed Iceberg gives you open storage with almost the whole platform intact; an external catalog gives you true engine neutrality at the cost of clustering, cloning, replication, and a maintenance burden you now own. Choose external only when a second engine is genuinely the system of record, otherwise you are paying in lost features for an openness you will never spend.

Sources