Comment by krapht
6 days ago
I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.
When I have small data that fits on my laptop, Pandas is good enough.
Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...
Your large db doesn’t sound very large. If I want to run a query that requires visiting every row of my biggest table, I will need to run the query a total of 480 times across 96 different Postgres databases. Just `select id from block` will take days to weeks.
But, I can visit most rows in that dataset in about 4 hours if I use an OLAP data warehouse thing, the kind of thing you build on top of DataFusion.
You’re right it isn’t for you.
It’s largely for companies who can’t put everything in a single database because (a) they don’t control the source schema e.g. it’s a daily export from a SaaS app, (b) the ROI is not high enough to do so and (c) it’s not in a relational format e.g. JSON, Logs, Telemetry etc.
And with the trend toward SaaS apps it’s a situation that is becoming more common.
Or when the data is massive - so even BigQuery would be crazy expensive.
I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.
How do you infer your Parquet schemas?
You infer the types of the source data.
For example you can go through say 1% of your data and for each column see if you can coerce all of the values to a float, int, date, string etc. And then from there you can set the Parquet schema with proper types.
> It's impossible to improve performance when you have to go out to the DB anyway;
That's not right. There are many queries that run far faster in duckdb/datafusion than (say) postgres, even with the overhead of pulling whole large tables prior to running the query. (Or use like pg_duckdb).
For certain types of queries these engines can be 100x faster.
More here: https://postgres.fm/episodes/pg_duckdb
> When I have large data, then I directly write SQL queries and run them against the database
what database is that? For example PgSQL will be XX-XXX times slower on OLAP queries than duckdb/polars/datafusion from various reasons.
Maybe your data is stored in a multi-PB pile of HDF5.