Comment by dharbin

3 days ago

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

One thing I admire about Snowflake is a real commitment to self-cannibalization. They were super out front with Iceberg even though it could disrupt them, because that's what customers were asking for and they're willing to bet they'll figure out how to make money in that new world

Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195

  • Have you interacted with Snowflake teams much? We are using external iceberg tables with snowflake. Every interaction pretty much boils down to you really should not be using iceberg you should be using snowflake for storage. It's also pretty obvious some things are strategically not implemented to push you very strongly in that direction.

    • Not surprised - this stuff isn’t fully mature yet. But I interact with their team a lot and know they have a commitment to it (I’m the other guy in that video)

    • Out of curiosity - can you share a few examples of functionality currently not supported with Iceberg but that works well with their internal format?

      3 replies →

  • Supporting Iceberg is eventually having people leaving you because they have better elsewhere, but this is birectionnal, it means you can welcome people from Databricks because you have feature parity.

It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services.

  • But if you are used to this type of compute per dollar what on earth would make you want to move to Snowflake?

    • Different parts of the analytical stack have different performance requirements and characteristics. Maybe none of your stack needs it and so you never need Snowflake at all.

      More likely, you don't need Snowflake to process queries from your BI tools (Mode, Tableau, Superset, etc), but you do need it to prepare data for those BI tools. Its entirely possible that you have hundreds of terabytes, if not petabytes, of input data that you want to pare down to < 1 TB datasets for querying, and Snowflake can chew through those datasets. There's also third party integrations and things like ML tooling that you need to consider.

      You shouldn't really consider analytical systems the same as a database backing a service. Analytical systems are designed to funnel large datasets that cover the entire business (cross cutting services and any sharding you've done) into subsequently smaller datasets that are cheaper and faster to query. And you may be using different compute engines for different parts of these pipelines; there's a good chance you're not using only Snowflake but Snowflake and a bunch of different tools.

When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense.

Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.

Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.

It'll probably be really difficult to set up.

If it's anything like super base, your question the existence of God when trying to get it to work properly.

You pay them to make it work right.

  • For testing, we at least have a Dockerfile to automate the setup of the pgduck_server and a minio instance so it Just Works™ with the extensions installed in your local Postgres cluster (after installing the extensions).

    The configuration mainly involves just defining the default iceberg location for new tables, pointing it to the pgduck_server, and providing the appropriate auth/secrets for your bucket access.