Building an Open, Multi-Engine Data Lakehouse with S3 and Python

2 days ago (tower.dev)

This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.

Python support of Iceberg seems to be the biggest unrealized opportunity right now. SQL support seems to be in good shape, with DuckDB and such, but Python support is still quite nascent.

i'm working on a project to do this with iceberg and sqlmesh executed via airflow at my job. sqlmesh seems really promising. i investigated multi-engine executions in dbt and it seems like you need to pay a lot of $$$ for it (multi-engine execution requires multiple dbt projects) and is not included in dbt core.

  • Toby and the team at Tobiko really are a pleasure to work with. They have strong opinions but have shown a good amount of willingness to implement features as long as there's a strong general use case. We've been working with them for almost a year now and it's really interesting seeing a early-ish open source library being developed by a start-up develops (and how much influence you can have over the direction if you work closely with the dev team).

    • that's great to hear. it mirrors my observational experience from being in their slack channel. i'm aware of the technical risks of being an early adopter of a product like this, but i must say part of me is excited to be on board early to help to shape it from a user perspective. i'm still not totally bought in yet (still in mvp phase) but our use case as we scale almost requires multi-engine execution (athena, spark on EMR, duckdb) and it doesn't seem like anyone is doing it better.

  • I'm one of the cofounders at Tower, posted this because I thought some people would be interested in the topic. Would be interested to know what Airflow is really...doing...for you here? Is it just an execution engine for your sqlmesh? Anyway, as we're trying to build out Tower would love to know more.

    • sqlmesh execution engine + cloud resource provisioning

      provision spark on emr or duckdb on beefy ec2 -> run sqlmesh -> wipe resources.

      i'm still in MVP phase of revamping my company's current data platform, so maybe there are better alternatives -- which i'd love to hear about.

      1 reply →

This article is about building open data lakehouse with the new open table format namely Iceberg.

For building single engine AWS based data lake house you can refer to this article [1], or just use Amazon Sagemaker that also support Iceberg.

Fun Amazon AWS data storage dictionary:

S3: Data Lake

Glacier: Archival Storage

DocumentDB: NoSQL Document Database ala MongoDB

DynamoDB: NoSQL KV and WC Database

RDS: SQL Database

Timestream: Time-Series Database

Neptune: Graph Database

Redshift: Data Warehouse

SageMaker: Data Lakehouse

Islander: Data Mesh (okay kidding, just made this up)

[1] Build a Lake House Architecture on AWS:

https://aws.amazon.com/blogs/big-data/build-a-lake-house-arc...

I am getting

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

error when trying to run

aws s3 ls s3://mango-public-data/lakehouse-snapshots/peach-lake --recursive