Comment by crazygringo
3 months ago
> Once you stop thinking iteratively and start thinking in set operations it becomes quite natural and powerful.
I dunno... I've written a tremendous amount of SQL, and I still have to think imperatively (iteratively) in order to write queries that are actually performant, and to know which indexes need to exist.
It would be wonderful if I could just think in terms of set operations, but that tends to result in queries that take 5 minutes to execute rather than 5 milliseconds.
My entire thought process is basically -- what table do I start with, what rows in what order, joining to what, under what conditions, aggregating how, rinse and repeat... It's entirely a mental model of loops and aggregation, never of set operations.
It may be true, until you do your ETL in an index-less database such as BigQuery or Trino. Postgres will always be faster for optimized, end user serving, queries.
But BigQuery allows you to scale it to 100s of CPUs without having to worry about indexes.
This sounds awful.
I would do almost any amount of iteration and index tuning to keep the query on a single machine rather than deal with a networked distributed system.
When you get slow queries the real problem is algorithmic complexity and linear workers only can do so much,
You can use duckdb on a single machine. It's also indexless (or more accurately, you don't have to explicitly create indexes)
You are rightfully proud of your skills!
Yes, I'm talking about end user queries. Not reports that take 2 hours to run.
But even with BigQuery, you've still got to worry about partioning and clustering, and yes they've even added indexes now.
The only time you really just get to think in sets, is when performance doesn't matter at all and you don't mind if your query takes hours. Which maybe is your case.
But also -- the issue isn't generally CPU, but rather communications/bandwidth. If you're joining 10 million rows to 10 million rows, the two biggest things that matter are whether those 10 million rows are on the same machine, and whether you're joining on an index. The problem isn't CPU-bound, and more CPU's isn't going to help much.
Of course there are optimizations to be made, such as not joining on the raw data or saving the order by to last. And avoid outer joins between two large sized partitioned tables.
But to me those optimizations are not imperative in nature.
(And BQ will probable eat the 10 million to 10 million join for breakfast...)