Comment by geertj

2 years ago

Understood. I should have asked my question a bit more specifically: what's wrong with CTEs that wouldn't be an issue with this new pipe syntax. I briefly scanned the paper and it appears there aren't any specific benefits to the pipe syntax that would make optimization easier. So we can expect that if a SQL engine doesn't optimize CTEs well it would likely have the same limitations for the pipe syntax.

Section 2.1.4 the paper lists the benefits of the pipe syntax over CTEs, and they are all based on ergonomics. As someone who has never had issues with the ergonomics of CTEs I must say I am not convinced that proposed syntax is better. It may be that I've been doing SQL for so long that I don't see its warts. Overall SQL feels like a very well designed and consistent language to me. The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.

5 comments

geertj

tmoertel 2 years ago

> The new pipe syntax appears to bolt on an imperative construct to an otherwise purely functional language.

It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output. It's still purely functional, but it has the advantage of making the execution order obvious. That is, the order is a linear top-down flow, not the inside-out flow implicit in vanilla SQL. Further, when your wanted flow doesn't match vanilla SQL's implicit ordering, you don't have to invent CTEs to wire up your flow. You just express it directly.

As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items. You must create a CTE to hold the top items and then wire it into a second SELECT statement to compute the stats. That's busywork. With pipe syntax, there's no need to invent that intermediate CTE.

geertj 2 years ago
> It's not imperative. The pipe symbol is a relational operator that takes one table as input and produces one as output.
Maybe I used the wrong term. In my mental model, the query planner decides the order in which the query is evaluated based on what table stats predict is most efficient query plan, and I actually don't really want to think about the order too much. For example, if I create a CTE, I don't necessarily want it to be executed in that order. Maybe a condition on the later query can be pushed back into the earlier CTE so that less data can be scanned.
I will admit that technically there should be no difference in how a query planner handles either. But to me the pipe syntax does not hint as much at these non-linear optimizations than CTEs do. I called the CTE syntax more functional as it implies less to me.
> but it has the advantage of making the execution order obvious.
So we're back to ergonomics which I just never had an issue with...
> As for ergonomics, consider a simple task: Report some statistics over the top 100 items in a table. Since LIMIT/ORDER processing is last in vanilla SQL's implied ordering, you can't directly compute the stats over the top items.
Could I not compute the stats over all values, then order and limit them, and depend on the query planner to not do the stat calculation for items outside the limit? If the order/limit does not depend on a computed statistic that should be possible? Or does that not happen in practice?
- tmoertel 2 years ago
  
  No, the wanted stats are a function of the top 100 items.
  As a concrete example, consider computing the average sales volume by category for the top 100 items. Here's the vanilla SQL for it:
  WITH TopItems AS ( SELECT category, sales_volume FROM Items ORDER BY sales_volume DESC LIMIT 100 ) SELECT category, AVG(sales_volume) AS avg_sales_volume FROM TopItems GROUP BY category;
  Because ORDER/LIMIT processing is implicitly last in vanilla SQL, if you need to do anything after that processing, you must do it in a new SELECT statement. Thus you must capture the ORDER/LIMIT results (e.g., as a CTE or, heaven forbid, as a nested SELECT statement) and then wire those results into that new SELECT statement via its FROM clause.
  In contrast, with SQL pipes you can express any ordering you want, so you can feed the ORDER/LIMIT results directly into the statistical computations:
  FROM Items |> ORDER BY sales_volume DESC |> LIMIT 100 |> AGGREGATE AVG(sales_volume) AS avg_sales_volume GROUP BY category
  That's way simpler and the data flows just as it reads: from top to bottom.
  
  1 reply →