This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.
Python support of Iceberg seems to be the biggest unrealized opportunity right now. SQL support seems to be in good shape, with DuckDB and such, but Python support is still quite nascent.
i'm working on a project to do this with iceberg and sqlmesh executed via airflow at my job. sqlmesh seems really promising. i investigated multi-engine executions in dbt and it seems like you need to pay a lot of $$$ for it (multi-engine execution requires multiple dbt projects) and is not included in dbt core.
Toby and the team at Tobiko really are a pleasure to work with. They have strong opinions but have shown a good amount of willingness to implement features as long as there's a strong general use case. We've been working with them for almost a year now and it's really interesting seeing a early-ish open source library being developed by a start-up develops (and how much influence you can have over the direction if you work closely with the dev team).
that's great to hear. it mirrors my observational experience from being in their slack channel. i'm aware of the technical risks of being an early adopter of a product like this, but i must say part of me is excited to be on board early to help to shape it from a user perspective. i'm still not totally bought in yet (still in mvp phase) but our use case as we scale almost requires multi-engine execution (athena, spark on EMR, duckdb) and it doesn't seem like anyone is doing it better.
I'm one of the cofounders at Tower, posted this because I thought some people would be interested in the topic. Would be interested to know what Airflow is really...doing...for you here? Is it just an execution engine for your sqlmesh? Anyway, as we're trying to build out Tower would love to know more.
This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.
nice pointer. Thanks! putting Zarr/icechunk/xarray into my weekend projects queue.
Python support of Iceberg seems to be the biggest unrealized opportunity right now. SQL support seems to be in good shape, with DuckDB and such, but Python support is still quite nascent.
i'm working on a project to do this with iceberg and sqlmesh executed via airflow at my job. sqlmesh seems really promising. i investigated multi-engine executions in dbt and it seems like you need to pay a lot of $$$ for it (multi-engine execution requires multiple dbt projects) and is not included in dbt core.
Toby and the team at Tobiko really are a pleasure to work with. They have strong opinions but have shown a good amount of willingness to implement features as long as there's a strong general use case. We've been working with them for almost a year now and it's really interesting seeing a early-ish open source library being developed by a start-up develops (and how much influence you can have over the direction if you work closely with the dev team).
that's great to hear. it mirrors my observational experience from being in their slack channel. i'm aware of the technical risks of being an early adopter of a product like this, but i must say part of me is excited to be on board early to help to shape it from a user perspective. i'm still not totally bought in yet (still in mvp phase) but our use case as we scale almost requires multi-engine execution (athena, spark on EMR, duckdb) and it doesn't seem like anyone is doing it better.
I'm one of the cofounders at Tower, posted this because I thought some people would be interested in the topic. Would be interested to know what Airflow is really...doing...for you here? Is it just an execution engine for your sqlmesh? Anyway, as we're trying to build out Tower would love to know more.
sqlmesh execution engine + cloud resource provisioning
provision spark on emr or duckdb on beefy ec2 -> run sqlmesh -> wipe resources.
i'm still in MVP phase of revamping my company's current data platform, so maybe there are better alternatives -- which i'd love to hear about.
1 reply →
This article is about building open data lakehouse with the new open table format namely Iceberg.
For building single engine AWS based data lake house you can refer to this article [1], or just use Amazon Sagemaker that also support Iceberg.
Fun Amazon AWS data storage dictionary:
S3: Data Lake
Glacier: Archival Storage
DocumentDB: NoSQL Document Database ala MongoDB
DynamoDB: NoSQL KV and WC Database
RDS: SQL Database
Timestream: Time-Series Database
Neptune: Graph Database
Redshift: Data Warehouse
SageMaker: Data Lakehouse
Islander: Data Mesh (okay kidding, just made this up)
[1] Build a Lake House Architecture on AWS:
https://aws.amazon.com/blogs/big-data/build-a-lake-house-arc...
I am getting
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
error when trying to run
aws s3 ls s3://mango-public-data/lakehouse-snapshots/peach-lake --recursive