Comment by Ciantic
1 year ago
Here is one example from the PDF:
FROM r JOIN s USING (id)
|> WHERE r.c < 15
|> AGGREGATE sum(r.e) AS s GROUP BY r.d
|> WHERE s > 3
|> ORDER BY d
|> SELECT d, s, rank() OVER (order by d)
Can we call this SQL anymore after this? This re-ordering of things has been done by others too, like PRQL, but they didn't call it SQL. I do think it makes things more readable.
The point of SQL pipe syntax is that there is no reordering. You read the query as a sequence of operations, and that's exactly how it's executed. (Semantically. Of course, the query engine is free to optimize the execution plan as long as the semantics are preserved.)
The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins:
From the paper:
> Each pipe operator is a unary relational operation that takes one table as input and produces one table as output.
Vanilla SQL is actually more complex in this respect because you have, for example, at least 3 different keywords for filtering (WHERE, HAVING, QUALIFY) and everyone who reads your query needs to understand what each keyword implies regarding execution scheduling. (WHERE is before grouping, HAVING is after aggregates, and QUALIFY is after analytic window functions.)
Golly, QUALIFY, a new SQL operator I didn’t know existed. I tend not to do much with window functions and I would have reached for a CTE instead but it’s always nice to be humbled by finding something new in a language you thought you knew well.
Is not common at all, is a non ANSI SQL clause that afaik was created by Teradata, syntactic sugar for filtering using window functions directly without CTEs or temp tables, especially useful for dedup. In most cases at least, for example you can't do a QUALIFY in an query that is aggregating data just as you can't use a window function when aggregating.
Other engines that implement it are direct competitors in that space: Snowflake, Databricks SQL, BigQuery, Clickhouse, and duckdb (only OSS implementation I now). Point is: if you want to compete with Teradata and be a possible migration target, you want to implement QUALIFY.
Anecdote: I went from a company that had Teradata to another where I had to implement all the data stack in GCP. I shed tears of joy when I knew BQ also had QUALIFY. And the intent was clear, as they also offered various Teradata migration services.
> The point of SQL pipe syntax is that there is no reordering.
But this thing resembles other FROM-clause-first variants of SQL, thus GP's point about this being just a reordering. GP is right: the FROM clause gets re-ordered to be first, so it's a reordering.
> The pipe operator is a semantic execution barrier:everything before the `|>` is assumed to have executed and returned a table before what follows begins
I already think about SQL like this (as operation on lists/sets), however thinking of it like that, and having previous operations feed into the next, which is conceptually nice, seems to make it hard to do, and think about:
> *(the query engine is free to optimize the execution plan as long as the semantics are preserved)
since logically each part between the pipes doesn't know about the others, so global optimizations, such as use of indexes to restrict the result of a join based on the where clause can't be done/is more difficult.
This kind of implies there's better or worse ordering. AFAIK that's pretty subjective. If the idea was to expose how the DB is ordering things, or even make things easier for autocomplete OK, but this just feels like a "I have a personal aesthetic problem with SQL and I think we should spend thousands of engineering hours and bifurcate SQL projects forever to fix it" kind of thing.
> Vanilla SQL [...] QUALIFY is after analytic window functions
Isn't that FILTER (WHERE), as in SELECT avg(...) FILTER (WHERE ...) FROM ...?
This is an interesting point.
All these years I've been doing that reordering and didn't even realize!
> The point of SQL pipe syntax is that there is no reordering.
If you're referring to this in the comment you're replying to:
> Can we call this SQL anymore after this? This re-ordering of things ...
Then they're clearly just saying that this is a reordering compared to SQL, which is undeniably true (and the while point).
The post I was referring to said that this new pipe syntax was a big reordering compared to the vanilla syntax, which it is. But my point is that if you're going to understand the vanilla syntax, you already have to do this reordering in your head because the order in which the the vanilla syntax executes (inside out) is the order in which pipes syntax reads. So it's just easier all around to adopt the pipe syntax so that reading and execution are the same.
'qualify' is now standard? Thought it was a vendor extension currently.
This is an extension on top of all existing SQL. The pipe functions more or less as a unix pipe. There is no reordering, but the user selects the order. The core syntax is simply:
Which results in a new query that can be piped again. So e.g. this would be valid too:
Personally, I can see this fix so much SQL pain.
okay, now I can see why this so much reminds of CTE
The multiple uses of WHERE with different meanings is problematic for me. The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Not sure if this is an attempt to simplify things or an oversight, but favoring convenience (no need to remember multiple keywords) over explicitness (but the keywords have different meanings) tends to cause problems, in my observation.
In the query plan, filtering before or after an aggregation is the same, so it's a strange quirk that SQL requires a different word.
I was not there at the original design decisions of the language, but I imagine it was there specifically to help the person writing/editing the query easily recognize and interpret filtering before or after an aggregation. The explicitness makes debugging a query much easier and ensures it fails earlier. I don't see much reason to stop distinguishing one use case from the other, I'm not sure how that helps anything.
3 replies →
Indeed. Just as I think git’s N different ways to refer to the same operation was a blunder.
5 replies →
You can always turn a HAVING in SQL into a WHERE by wrapping the SELECT that has the GROUP BY in another SELECT that has the WHERE that would have been the HAVING if you hadn't bothered.
You don't need a |> operator to make this possible. Your point is that there is a reason that SQL didn't just allow two WHERE clauses, one before and one after GROUP BY: to make it clearer syntactically.
Whereas the sort of proposal made by TFA is that if you think of the query as a sequence of steps to execute then you don't need the WHERE vs. HAVING clue because you can see whether a WHERE comes before or after GROUP BY in some query.
But the whole point of SQL is to _not have to_ think of how the query is to be implemented. Which I think brings us back to: it's better to have HAVING. But it's true also that it's better to allow arbitrary ordering of some clauses: there is no reason that FROM/JOIN, SELECT, ORDER BY / LIMIT have to be in the order that they are -- only WHERE vs. GROUP BY ordering matters, and _only_ if you insist on using WHERE for pre- and post-GROUP BY, but if you don't then all clauses can come in any order you like (though all table sources should come together, IMO).
So all in all I agree with you: keep HAVING.
> The second WHERE, filtering an aggregate, would be HAVING in standard SQL.
Only if you aren't using a subquery otherwise you would use WHERE even in plain SQL. Since the pipe operator is effectively creating subqueries the syntax is perfectly consistent with SQL.
Perhaps, however then you eliminate the use of WHERE/HAVING sum(r.e) > 3, so in case you forgot what the alias s means, you have to figure that part out before proceeding. Maybe I'm just used to the existing style but as stated earlier, seems this is reducing explicitness which IMO tends to lead to more bugs.
8 replies →
Should we introduce a SUBSELECT keyword to distinguish between a top-level select and a subquery?
To me that feels as redundant as having WHERE vs HAVING, i.e. they do the same things, but at different points in the execution plan. It feels weird to need two separate keywords for that.
The proposal here adds pipe syntax to SQL.
So it would be reasonable to call it SQL, if it gets traction. You want to see some of the big dogs adopting it.
That should at least be possible since it looks like it could be added to an existing implementation without significant disruption/extra complexity.
There may be trademark issues, but even if not, doing sufficient violence to the original thing argues for using a new name for the new thing.
Yes, having |> isn't breaking SQL but rather enhancing it.
I really like this idea of piping SQL queries rather than trying to create the perfect syntax from the get go.
+1 for readability too.
Honestly, it seems like a band-aid on legacy query language.
SQL a legacy query language?
In order for a thing to be considered legacy, there needs to be a widespread successor available.
SQL might have been invented in the 70s but it's still going strong as no real alternative has been widely adopted so far - I'd wager that you will find SQL at most software companies today.
Calling it legacy is not realistic IMO.
6 replies →
Not bad, very similar to dplyr syntax. Personally i’m too used to classic SQL though and this would be more readable as CTEs. In particular how would this syntax fair if it was much more complicated with with 4-5 tables and joins?
IMO having SELECT before FROM is one of SQL's biggest mistakes. I would gladly welcome a new syntax that rectifies this. (Also https://duckdb.org/2022/05/04/friendlier-sql.html)
I don't love the multiple WHEREs.
Duckdb also supports prql with an extension https://github.com/ywelsch/duckdb-prql
SQL was supposed to follow English grammar. Having FROM before SELECT is like having “Begun” before “these clone wars have.”
That's a great list of friendlier sql in DuckDB. For most of that list I either run into it regularly or have wanted the exact fix they have.
duckDB is what sql should be in 2024
https://duckdbsnippets.com/
The very first example on that page is vulnerable to injection.
3 replies →
> Can we call this SQL anymore after this?
Maybe not, just as we don't call "rank() OVER" SQL. We call it SQL:2003. Seems we're calling this GoogleSQL. But perhaps, in both cases, we can use SQL for short?
You show a good example. Many people would call that SQL, and if pipes become popular, they too might simply be called SQL one day.
> GoogleSQL
EssGyooGell: A Modest Proposal
this is consistent, non-pseudo-english, reusable, and generic. The SQL standard largely defines the aesthetic of the language, and is in complete opposition to these qualities. I think would be fundamentally incorrect to call it SQL
Perhaps if they used a keyword PIPE and used a separate grammar definition for the expressions that follow the pipe, such that it is almost what you’d expect but randomly missing things or changes up some keywords
Yes we can call it sql.
Language syntax changes all the time. Their point is that sql syntax is a mess and can be cleaned up.
In that example, "s" has two meanings: 1. A table being joined. 2. A column being summed.
For clarity, they should have assigned #2 to a different variable letter.
Honestly SQL screwed things up from the very beginning. "SELECT FROM" makes no sense at all. The projection being before the selection is dumb as hell. This is why we can’t get proper tooling for writing SQL, even autocompletion can’t work sanely. You write "SELECT", what’s it gonna autocomplete?
PRQL gives me hope that we might finally get something nice some day
The initial version of SQL was called "Structured English Query Language".
If the designers intended to create a query language that resembled an English sentence, it makes sense why they chose "SELECT FROM".
"Select the jar from the shelf" vs. "From the shelf, select the jar".
“Go to the shelf and select the jar”. You’re describing a process, so it’s natural to formulate it in chronological order.
1 reply →
FROM clauses aren't required, and using multiple tables in FROM doesn't seem to work out too well when that syntax is listed first.
Doesn’t change anything, you can still have the select at the end, and optional from and joins at the beginning. In your example, the select could be at the end, it’s just that there’s nothing before.
Beginning with Oracle Database Release 23 [released May 2, 2024], it is now optional to select expressions using the FROM DUAL clause.
WITH clauses are optional and appear before SELECT. No reason why the FROM clause couldn't behave the same
1 reply →
I also hate having SELECT before FROM because I want to think of the query as a transformation that can be read from top to bottom to understand the flow.
But I assume that that’s part of why they didn’t set it up that way — it’s just a little thing to make the query feel more declarative and less imperative
> what’s it gonna autocomplete?
otoh if you selected something the from clause and potentially some joins could autocomplete
Not reliably, especially if you alias tables. Realistically, you need to know what you’re selecting from before knowing what you’re selecting.
At this point I think that vanilla SQL should just support optionally putting the from before the select. It's useful for enabling autocompletion, among other things.
And a simple keyword that does a GROUP BY on all columns in select that aren't aggregates, just a syntax level macro-ish type of thing.
My initial reaction is that the pipes are redundant (syntactic vinegar). Syntactic order is sufficient.
The changes to my SQL grammar to accomodate this proposal are minor. Move the 'from' rule to the front. Add a star '*' around a new filters rule (eg zero-or-more, in any order), removing the misc dialect specific alts, simplifying my grammar a bit.
Drop the pipes and this would be terrific.
In the example would there a difference between `|> where s > 3` and `|> having s > 3` ?
Edit: nope, just that you don't need having to exist with the pipe syntax.
Looking at this reminds me of Apache Pig. That’s not a compliment.
We can call it "Linq2SQL" and what a disaster it was...
What's the SQL equivalent of this?