Comment by fifilura

3 months ago

It may be true, until you do your ETL in an index-less database such as BigQuery or Trino. Postgres will always be faster for optimized, end user serving, queries.

But BigQuery allows you to scale it to 100s of CPUs without having to worry about indexes.

This sounds awful.

I would do almost any amount of iteration and index tuning to keep the query on a single machine rather than deal with a networked distributed system.

When you get slow queries the real problem is algorithmic complexity and linear workers only can do so much,

Yes, I'm talking about end user queries. Not reports that take 2 hours to run.

But even with BigQuery, you've still got to worry about partioning and clustering, and yes they've even added indexes now.

The only time you really just get to think in sets, is when performance doesn't matter at all and you don't mind if your query takes hours. Which maybe is your case.

But also -- the issue isn't generally CPU, but rather communications/bandwidth. If you're joining 10 million rows to 10 million rows, the two biggest things that matter are whether those 10 million rows are on the same machine, and whether you're joining on an index. The problem isn't CPU-bound, and more CPU's isn't going to help much.

  • Of course there are optimizations to be made, such as not joining on the raw data or saving the order by to last. And avoid outer joins between two large sized partitioned tables.

    But to me those optimizations are not imperative in nature.

    (And BQ will probable eat the 10 million to 10 million join for breakfast...)