Comment by andy800

1 year ago

The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.

In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.

  • I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.

    • I also wasn't there, but I think this actually wasn't to help authors and instead was a workaround for the warts of SQL. It's a pain to write

          SELECT * FROM (SELECT * FROM ... GROUP BY ...) t WHERE ...
      

      and they decided this was common enough that they would introduce a HAVING clause for this case

          SELECT * FROM ... GROUP BY ... HAVING ...
      

      But the real issue is that in order to make operations in certain orders, SQL requires you to use subselects, which require restating a projection for no reason and a lot of syntactical ceremony. E.g. you must give the FROM item a name (t), but it's not required for disambiguation.

      Another common case is projecting before the filter. E.g. you want to reuse a complicated expression in the SELECT and WHERE clauses. Standard SQL requires you to repeat it or use a subselect since the WHERE clause is evaluated first.

    • I think this stems from the non-linear approach to reading a SQL statement. If it were top-to-bottom linear, like PRQL, then the distinction does not seem merited. It would then always be filtering from what you have collected up to this line.

    • I think the original sin here is not making aggregation an explicitly separate thing, even though it should be. Adding a count(*) fundamentally changes what the query does, and what it returns, and what restrictions apply.

You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.

You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.

Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.

But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).

So all in all I agree with you: keep HAVING.

> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.

Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.

  • Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.

    • A lot of SQL engines don't support aliases in the HAVING clause and that can require duplication of potentially complex expressions which I find very bug-inducing. Removing duplication and using proper naming I think would be much better.

      I will already use subqueries to avoid issues with HAVING.

      7 replies →

Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?

To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.