Comment by snthpy
2 years ago
This comes up every time PRQL makes it onto HackerNews and is a fair question:
Short answer: DX
Slightly longer answer: Developer productivity and experience, especially for EDA and interactively writing complex analytical queries.
Most people that have tried PRQL just find it more convenient to write their analytical queries in it. PRQL compiles to SQL so it can't express anything you can't already do in SQL, but you can probably express yourself much faster in PRQL.
Just try the following query in the online PRQL Playground (https://prql-lang.org/playground/) to find the longest track per album:
```prql
from tracks
group album_id (
sort {-milliseconds}
take 1
)```
How long would it take you to write the SQL for that?
Disclaimer: I'm a PRQL contributor.
Reformatted for readability (indent code blocks with two spaces):
Editorializing:
Data query specification is all about getting the details right. This does not look simpler than the corresponding SQL to me though. All components must be present -- scope, group, limit, order.
SQL, for all its faults, is generally succinct at incorporating the required details. The PRQL sample here is succinct as well, but to me at least, not differentiating.
> How long would it take you to write the SQL for that?
I don't want to appear rude, but unless I'm missing something, this is a pretty simple SQL query, of the kind anyone with mimimal SQL experience could write off the top of their head in seconds.
I like the idea of PRQL, but I think a better example is needed to sell it.
I have a moderate amount of SQL experince, but I could not write that query at the top of my head. Maybe you misunderstand what the PRQL query is doing?
Here's the SQL it generates:
WITH table_0 AS ( SELECT , ROW_NUMBER() OVER ( PARTITION BY album_id ORDER BY milliseconds DESC ) AS _expr_0 FROM tracks ) SELECT FROM table_0 WHERE _expr_0 <= 1
If I understand PRQL correctly, it finds the longest song for each album? A simple concept, but not a simple MySQL query.
SELECT DISTINCT ON (album_id) * FROM tracks GROUP BY album_id ORDER BY milliseconds DESC;
For those unfamiliar with Postgres, DISTINCT ON takes only one row for each of the groups based on the supplied columns. So in this case, it will return only one row per album_id.
Without an ORDER BY, DISTINCT ON chooses a random row (not actually, but you can’t rely on it.) Since we ORDER BY milliseconds DESC, the first row of each group will be the longest one.
2 replies →
It's a pretty common pattern (take the top N by group with some order), so I'd expect that it's a familiar pattern to a lot of people who regularly do analytical queries in SQL. It's clearer with formatting/naming:
But it is still super ugly for such a common need. If I were to add syntax to make this kind of thing easier, I'd just go for a syntax that made something like this valid:
Which apparently the QUALIFY statement does in a few dialects.
This "find the latest row for each <column>" query is kind of a poster-child for seemingly-simple but actually difficult to get right/performant sql.
E.g. see: https://stackoverflow.com/questions/1313120/retrieving-the-l...
I think the logic here is: SQL is hard and most people don't know it well. So PRQL is perhaps easier to learn.
The same logic is applied to TypeScript vs JavaScript. Or C vs assembly. Or Nano vs vim. Etc etc.
It's a quandary though. SQL is clearly difficult for most people to get their heads around. It does require a different way of thinking about data, and you can get by with a minimal SQL knowledge for a long time, especially nowadays with ORMs.
But like so many other things, making the investment is worthwhile and pays off in small and large ways, forever (so far).
> making the investment is worthwhile and pays off in small and large ways
And... 'making the investment' takes time, and means that time is not able to be invested someplace else.
If the majority of your job is writing SQL or similar (data access, etc) then sure - yes, learn more of those tools. Some folks have a wider range of responsibilities that means you have to decide what to make more time investments in, and saying 'yes' to something is necessarily saying 'no' to other things.
1 reply →
The hard part of SQL is thinking relationally. This obviously doesn’t do anything to affect that — if they actually thought this, they’d be making the same mistake as SQL itself “the reason business users don’t program is because it’s not english enough”
The problem this is resolving, if it successfully resolves anything at all, is that the SQL language is a mess of random keywords, inconsistent syntax requirements and generates some of the worst error messages known to man. It’s an attempt at making SQL a consistent, simple language — ideally exposing the data model more directly and with less noise
2 replies →
I'm not a PRQL fun. But TBH I can't write this SQL from my head. I have thousand lines of written SQL.
This seems like a micro-optimization to me. That's faster to write, but you pay for that with extra tooling and an abstraction layer that you now have to train or hire for, and I'm not sure that's a good payoff.
This idea seems like it'd be better as an editor plugin that lets you write shorthand and have it automatically expanded into correct SQL rather than as a build time thing.
You can apply this logic to choosing a high level language vs writing assembly code.
Yes the pipeline is more complex, there are more tools and more syntax to track, but the benefits are pretty clear (or we'd all be writing UI code in hand written asm)
While leaky abstractions are a huge problem, the thing about abstractions is that, if they are any good, the benefits and improvements to productivity outweigh the negatives. You just have to figure out if the gains PRQL could give you are worth the effort.
I rarely write SQL, so it's not worth it for me. But if PRQL were an actual query engine, not just a translation layer, and some database offered a native PRQL interface, I would immediately switch to it rather than to keep twisting my brain with SQL and it's inane syntax and rules.
(I was a full time DBA in one previous life, so I should be more comfortable with SQL than most.)
I don't think the analogy of with js/etc and assembly is quite fair, the difference in readability between this and sql isn't on that order, and if it were I'd be a lot more bullish on this project :)
> How long would it take you to write the SQL for that?
Eh, `DISTICT ON` is a custom PostgreSQL extension.
A standard* method would be:
But the QUALIFY clause is so new that it doesn't work on most RDBMSs. If you're on MS SQL Server, you're still using:
That said, I still don't think PRQL is particularly amazing. I can't tell if it's merely syntactic sugar for SQL, or if it's actually meant to control query execution. If it's the former, it's likely to frustrate developers because it's actually just another layer of abstraction. If it's the latter, then it requires the developer to not only understand the data model well enough to be able to write SQL queries, they need to be able to understand the RDBMS impementation details well enough to be able to write queries that best take advantage of the current database's indexes, statistics, and configuration. Even something as simple as sorting before filtering or projecting can be a significant performance issue. Nevermind the fact that relational algebra done in the wrong order can be non-deterministic or not equivalent transformations, so even if the query processor is smart enough to do rewrites whatever the developer enters might be logically different unintentionally.
Ultimately I think it's a tool that lets the developer thinking about the problem in the way they prefer, rather than thinking about the problem in the way that best suits the problem at hand. Like insisting on writing documentation in LaTeX instead of Word or Markdown.
*: I believed this was in SQL 2023, but double checking it looks like it did not have make the final standard. I would be surprised if it didn't make it in the future, however.
The PRQL extension discussed here is for Postgres, so my solution was Postgres specific.
Since when is QUALIFY part of the SQL standard? So far I have only seen it as a proprietary feature in Terradata.
Damn. Reading your comment, i was about to be really glad that this pain would be a thing of the past before too long. Too bad it didn't make the standard :(
2 replies →
With the additional benefit that resources about SQL are everywhere and every question has been answered already multiple times. On top of that you are most likely to encounter SQL at a job than PRQL.
You'd also need a LIMIT or a TOP, and you might need a subquery for that depending on your SQL of choice.
The question was "the longest track per album", so I'd expect one row per album as the result (which is what my query does). But adding a LIMIT would not do that.
It might be that the question was "the longest track across all albums" which indeed would require a LIMIT
1 reply →
select id from tracks qualify row_number() over partition by (album_id order by milliseconds desc) =1
That should work
Look I’m a diehard SQL just use it guy but open to improvements. But I’m loathe to use abstractions for things when the underlying thing is so expressive.
Autocomplete of fields in a good editor, schema help, etc go a long way to making SQL being written raw very nice.
QUALIFY is not part of the SQL standard.
I'm pretty sure QUALIFY was added in SQL 2023. Maybe it was only discussed and didn't make it.
2 replies →
Then a subselect with the window function and an outer where clause where the window function column = 1
In the real world, probably something like:
I agree the PRQL's pretty nice here, but I think such a generalised example (chances you actually want to `select *`?) overstates the advantage.
This query would error: "Field `id` is neither in the GROUP BY nor in an aggregate function." Since `id` is a key in `tracks` and `album_id` is a foreign key and not unique in `tracks`, the RDBMS wouldn't be able to use any implicit GROUP BY determinism, either.
You could do this:
But this is unlikely to perform as well as the row_number() method because it will cause the RDBMS to generate a record for every track and then waste time sorting the intermediate results to find the unique records in the output.
your query will return just album_id and id, not other track fields
They can ask chatgpt to generate the SQL query and use the SQL output that everyone is familiar with rather than use an abstraction that is prone to versioning and behavioral changes and will consume everyone else's time to go learn a new language and become profficient in it.
if they need ChatGPT to generate the query (costly and possibly error-prone), then it means that not everyone is familiar with the output...
I have a feeling this language will be more familiar to programmers who think functionally. I say this because it seems to consist of transformations applied to data and of `derive`, which defines new variables later used in other transformations.
This would have been awesome for me 2 years ago.
Currently much of my complicated SQL is generated by a LLM.
Hmm, I would think that LLM helps adoption for the semantic layers such as PRQL, Malloy, and dbt since it's possible to generate/validate/iterate 5 lines of PRQL compared to 25 lines of SQL but considering none of them widely adopted yet, you might indeed be correct in a way that LLM makes it harder for the new tools to gain adoption by helping you to suffer less from the verboseness of SQL.
It’s a tough call. I run a small analytics team and am starting to train some analysts to code. Just the other day I basically told one of my reports to focus on learning Python and let ChatGPT teach him SQL by example because I think it’ll be easier to grok the explanations. Now I’m looking at PRQL and Malloy and asking myself if it’s really a path I should send them down, and I’m not sure it’s a good idea.
1 reply →
Sorry to be that guy :-) The `introduction.prql` example on the playground gives an idea of the better readability of PRQL vs SQL and your small example an idea of the speed you may gain from writing PRQL vs SQL. It is interesting. Indeed, me writing the SQL would have taken more time than you writing the PRQL.
Won't this just lead to developers wrecking performance because they don't understand what's happening?