Comment by mocamoca

1 year ago

Question for people writing highly complex SQL queries.

Why not write simple SQL queries and use another language to do the transformations?

Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries? Doesn't working without reusable blocks / tests / logs make development harder?

Syntax is one thing, but actual performance (and safety/maintenance) is another deal?

Performance. A million times performance.

If I need to join a million rows to another table and then calculate an aggregate value, and do it all in a single query, it's fast. It might only take 0.01 seconds.

Whereas if I make separate queries to retrieve a million rows, and a million rows from another table, then it's incredibly slow just due to the data transfer. No matter how conceptually easier another language may be. So it might take 30 seconds for repeated sequential serialization and network and deserialization.

And even if you're looking up individual rows (not a million), with lots of joins that would be lots of round-trips to the database which multiplies latency and load -- so even if you can still get your final result quickly (e.g. 0.01 seconds rather than 0.001 seconds), the load you can handle drops by the same multiple (e.g. you can now only serve 10% as many users).

The general rule of thumb with databases is that they can be super-fast when everything is done on the database machine in a single query and your final result is a small amount of data (assuming everything is indexed properly and the query is written properly). But they become super-slow when you're doing a bunch of queries in a row, and where those intermediate queries can return massive amounts of data, or (even worse) need to send back massive amounts of data.

I've worked on a few SQL systems used for analytics and ETL.

My users fell into (for the purposes of this discussion) three categories:

1. Analysts who prefer sheets

2. Data scientists that prefer pandas

3. Engineers who prefer C++/Java/JavaScript/Python

I'm fairly sure SQL isn't the first choice for any of them, but in all three cases a modern vectorized SQL engine will be the fastest option for expressing and executing many analysis and ETL tasks, especially when the datasets don't fit on a single machine. It's also easier to provide a shared pool of compute to run SQL than arbitrary code, especially with low latency.

Even as a query engine developer, I would prefer using a SQL engine. Performing even the basic optimizations a modern engine would perform -- columnar execution, predicate pushdown, pre-aggregation for shuffles, etc -- would be at least a week of work for me. A bit less if I built up a large library to assist.

I'm not interested in loading all the data in memory, or swapping it out if it doesn't fit, but also it may be a lot of data, and just moving it over the network is hard.

I have tests. I have reusable blocks (SQL functions, WITH blocks and views). I don't have logging though.

I can put the result in a non-materialized view and have it update in real time as the data changes. Or I can toggle it to materialized view and now it's snapshotted data.

Finally, views that depend on views that depend on views get automatically optimized by the query planner. You need a lot of very tricky custom code to start approaching that.

> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

As others have said, yes. In most cases, the more complex the query the better the result.

> Doesn't working without reusable blocks / tests / logs make development harder?

SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want and then then the RDBMS goes off and finds the most performant way to retrieve the data that matches that shape. It doesn't compare well with procedural programming.

The closest "languages" that comparable to SQL are HTML and CSS. However you manage HTML and CSS is also how you can manage SQL.

  • >> Doesn't working without reusable blocks / tests / logs make development harder?

    > SQL isn't a programming language, it's a query language. You're effectively writing a single expression that describes the shape of the data that you want...

    Exactly this. Generally speaking, your SQL queries won't have "bugs" the way that you can create bugs when writing a function.

    The challenging parts of building a complex query are usually 1) getting it to function at all (just being a valid SQL statement for your tables that gives you your desired output fields), and 2) making sure it runs performantly on realistically large table sizes (usually in milliseconds as opposed to seconds), which may involve rewriting things like joins vs. subqueries and/or adding indexes.

    A lot of bugs in functions come from edge cases or different combinations of paths through code or unexpected combinations of parameter values or math formula errors or whatever... but a SQL query won't usually really have any of those things. It's just a single transformation that basically either works or doesn't.

One reason SQL has become more popular lately is as an API for map/reduce.

Before you would write a Java/c++ class that would do the map/reduce job for you distributed over 100s of CPUS. And you would feel like you were on the bleeding edge doing innovative stuff.

Turns out that SQL is a perfect API for map/reduce.

Everything you write in SQL can be transformed into a massively parallel job. And you don't even know about it.

This is the secret behind BigQuery and Trino/Presto/Athena.

  • If your data is relational. SQL can get pretty ugly with semi-structured data, graph traversal etc.

> Why not write simple SQL queries and use another language to do the transformations?

> Are SQL engines really more efficient at filtering/matching/aggregating data when doing complex queries?

Yes. With knowledge of the data (like indices) and statistics of the data it is usually very much more efficient than piping the data to another process to handle the same job.

Of course that requires you to write a good query and know how to tell what makes a good query, but if your data is in a relational database it is usually more efficient to do your filtering/matching/aggregating there.