Comment by geertj

1 year ago

> It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output.

Maybe I used the wrong term. In my mental model, the query planner decides the order in which the query is evaluated based on what table stats predict is most efficient query plan, and I actually don't really want to think about the order too much. For example, if I create a CTE, I don't necessarily want it to be executed in that order. Maybe a condition on the later query can be pushed back into the earlier CTE so that less data can be scanned.

I will admit that technically there should be no difference in how a query planner handles either. But to me the pipe syntax does not hint as much at these non-linear optimizations than CTEs do. I called the CTE syntax more functional as it implies less to me.

> but it has the advantage of making the execution order obvious.

So we're back to ergonomics which I just never had an issue with...

> As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items.

Could I not compute the stats over all values, then order and limit them, and depend on the query planner to not do the stat calculation for items outside the limit? If the order/limit does not depend on a computed statistic that should be possible? Or does that not happen in practice?

2 comments

geertj

tmoertel 1 year ago

No, the wanted stats are a function of the top 100 items.

As a concrete example, consider computing the average sales volume by category for the top 100 items. Here's the vanilla SQL for it:

    WITH
      TopItems AS (
        SELECT category, sales_volume
        FROM Items
        ORDER BY sales_volume DESC
        LIMIT 100
      )
    SELECT category, AVG(sales_volume) AS avg_sales_volume
    FROM TopItems
    GROUP BY category;

Because ORDER/LIMIT processing is implicitly last in vanilla SQL, if you need to do anything after that processing, you must do it in a new SELECT statement. Thus you must capture the ORDER/LIMIT results (e.g., as a CTE or, heaven forbid, as a nested SELECT statement) and then wire those results into that new SELECT statement via its FROM clause.

In contrast, with SQL pipes you can express any ordering you want, so you can feed the ORDER/LIMIT results directly into the statistical computations:

    FROM Items
    |> ORDER BY sales_volume DESC
    |> LIMIT 100
    |> AGGREGATE AVG(sales_volume) AS avg_sales_volume
       GROUP BY category

That's way simpler and the data flows just as it reads: from top to bottom.

geertj 1 year ago

Okay, thanks for that example. The ability of the pipe syntax to re-order the standard SQL pipeline order does indeed provide for more compact queries in this case.