Comment by ako

4 years ago

With a CTE it would read a bit more like prql:

  with usa_employees as (
    SELECT
            title,
            country,
            salary,
            (salary + payroll_tax)                   AS gross_salary,
            (salary + payroll_tax + healthcare_cost) AS gross_cost
    FROM  employees
    WHERE country = 'USA'
    AND   (salary + payroll_tax + healthcare_cost) > 0
  )
  select  title,
        country,
        AVG(salary)         AS average_salary,
        SUM(salary)         AS sum_salary,
        AVG(gross_salary)   AS average_gross_salary,
        SUM(gross_salary)   AS sum_gross_salary,
        AVG(gross_cost)     AS average_gross_cost,
        SUM(gross_cost)     AS sum_gross_cost,
        COUNT(*) as emp_count
  from      usa_employees
  group by  title, country
  having    count(*) > 200
  order by  sum_gross_cost
  limit 3

Readability is pretty similar to prql. It would really help in SQL if you could refer to column aliases so you don't have to repeat the expression.

My brain just thinks in CTEs over sub queries. I really dislike that my co-workers use these ridiculously nested sub sub sub queries.

I just look at something like this and I immediately know what's going on. If it's nested sub queries it always takes me much longer.

  • To me those nested sub sub sub SQL queries come from a similar place as beginner coders who tend to make nested IF statements - a lack of experience with the language.

    For very complicated stuff, SQL does become very hard to read compared to e.g. tidyverse + targets in R.

In some cases for removing repeating (intermediate) calculations, I generally find it easier to use a lateral join (in postgres), like

    select
        title,
        country,
        avg(salary)         as average_salary,
        sum(salary)         as sum_salary,
        avg(gross_salary)   as average_gross_salary,
        sum(gross_salary)   as sum_gross_salary,
        avg(gross_cost)     as average_gross_cost,
        sum(gross_cost)     as sum_gross_cost,
        count(*)            as emp_count
    from
        employees,
        lateral ( select
            (salary + payroll_tax)                   as gross_salary,
            (salary + payroll_tax + healthcare_cost) as gross_cost
        ) employee_ext
    where
        country = 'usa'
        and gross_cost > 0
    group by  title, country
    having    count(*) > 200
    order by  sum_gross_cost
    limit 3;

  • So now we have easily come up with three different ways of rewriting the query to avoid that duplication (which obviously was not a problem at all to begin with): subquery, CTE and lateral join. And there are also several more well known ways (views, custom functions, computed columns etc) so the whole premise now for even inventing a "better" language than SQL is then false? Or what am I missing.

    It's also weird how people always argue for immutability and eliminating local state, when using procedural languages, but as soon as they switch to SQL, that actually works like this, they immediately want to introduce mutability and local state.

    • > so the whole premise now for even inventing a “better” language than SQL is then false?

      I don't think anyone is using the above examples to try invalidate PRSQL, just suggesting the baseline for comparisons should account for all constructs available in the SQL standards and common implementations there-of.

      > Or what am I missing.

      The statement “I can do X better than <SQL example> with <something else>” does not properly show the benefit of <something else> if “I can do X better than <SQL example> with <another SQL example>” is also true (assuming <another SQL example> is actually agreed to be better, not for instance convoluted/confusing/long-winded/other so just replacing some problems with others).

    • If there's multiple ways to do the same thing that's usually a BAD thing in terms of language design. Especially if some approaches are just newbie traps that experts learn to avoid, or if deciding the best method is a really subtle context-dependent decision. The ideal design is that the language encourages the one obviously "good" way to do it.

      1 reply →

Column aliases would have saved me hundreds of hours over the course of my career. Sorely missing from standard SQL, and would make the need for PRQL less acute.

Snowflake lets you refer to column aliases, and it's great!

There's the slight issue of shadowing of table column names, which they resolve by preferring columns to aliases if both are named the same. So sometimes my aliases end up prefixed with underscores, but that's not a big deal.

  • The trade-off is that a schema change (adding a column) unrelated to your query can modify its behavior.

    Favoring aliases over columns instead has the potential to introduce irresolvable ambiguities as you can’t “qualify” a column alias with a SELECT list or subquery ID the way you can qualify a column by its table/view alias.

> With a CTE

The DB we use supports those, I just learned about them too late so keep forgetting they exist :(

> It would really help in SQL if you could refer to column aliases so you don't have to repeat the expression.

The DB we use supports that, so in your CTE you could write

   AND   gross_cost > 0

We do that all the time, which will be a pain now that we're migrating to a different DB server which doesn't.

Not all database systems can optimize queries well over CTE boundaries. I believe this is still true for PostgreSQL (no longer true, see below -- it was true a few years ago). So there's a potential performance hit for (the otherwise excellent advice of) writing with CTE's.

what expressions are being repeated here?

  • > (salary + payroll_tax) AS gross_salary,

    > (salary + payroll_tax + healthcare_cost) AS gross_cost

    > AND (salary + payroll_tax + healthcare_cost) > 0

    And his is a simple example.