Comment by BeefWellington

4 years ago

This is another in a series of these kinds of proposals that look excellent on first glance for perhaps the 75% case but start getting syntactically messy when I want to customize the resultset returned.

On the surface, they're always neat but when you start to dig into how you'd implement something in an RDBMS, it begins to fall apart.

Let's look at the example syntax:

    from employees
    filter country = "USA"                         # Each line transforms the previous result.
    gross_salary = salary + payroll_tax            # This _adds_ a column / variable.
    gross_cost   = gross_salary + healthcare_cost  # Variable can use other variables.
    filter gross_cost > 0
    aggregate split:[title, country] [             # Split are the columns to group by.
        average salary,                            # These are the calcs to run on the groups.
        sum     salary,
        average gross_salary,
        sum     gross_salary,
        average gross_cost,
        sum     gross_cost,
        count,
    ]
    sort sum_gross_cost                            # Uses the auto-generated column name.
    filter count > 200
    take 20

Where in here is it clearly stated which fields are returned? In the original SQL it's right up front but here it's buried into the "aggregate" function, and I'm not clear that this isn't an oversight.

Another example that speaks to the "how do I implement this" side of the equation:

    from employees
    filter country = "USA"                         # Each line transforms the previous result.
    gross_salary = salary + payroll_tax            # This _adds_ a column / variable.
    gross_cost   = gross_salary + healthcare_cost  # Variable can use other variables.
    filter gross_cost > 0

Does this mean that the database must scan all records of the employee table in order to return the result before moving to the next step in the query? Must I index all fields? If not, how does a query planner prepare for this scenario?

The major tradeoff you make in most ORMs is exactly this: You lose out on being able to be explicit about how many queries are sent to the DB (and in many cases how efficient those queries are). Now this would become a language feature? What do I gain for that loss?

I'm not saying that SQL Syntax is perfect; far from it. I'm not seeing how this is an improvement.

I think if you want traction though, a proof of concept using an existing RDBMS would go a long way into providing evidence that this will work and is sufficiently thought out to deal with even the basics of what existing SQL databases have to. Query planning is hard, especially if you want it to be fast.

> Where in here is it clearly stated which fields are returned? In the original SQL it's right up front but here it's buried into the "aggregate" function, and I'm not clear that this isn't an oversight.

It's in the aggregate portion, like you said. Other example queries have a select portion. Why does it matter that it's not in the leading position like SQL?

> Does this mean that the database must scan all records of the employee table in order to return the result before moving to the next step in the query? Must I index all fields? If not, how does a query planner prepare for this scenario?

No, they are just describing how the statement is supposed to be interpreted by a human. I think you can basically just shuffle all the filter statements to the end and keep it logically equivalent.

This is a proposal for a "transpiles to SQL" language. So long as that transpliation is predictable, you cannot run into the sort of issues you are describing.

  • > It's in the aggregate portion, like you said. Other example queries have a select portion. Why does it matter that it's not in the leading position like SQL?

    I don't mind it not being in the leading position. The author provided a very simple query and in that case it's not immediately apparent what fields to expect the resultset to contain when returned to the consumer.

    This is a troubleshooting issue more than anything else. IMO placing the "selected fields" into the very centre of the query is distracting and obfuscates what is happening.

    > This is a proposal for a "transpiles to SQL" language. So long as that transpliation is predictable, you cannot run into the sort of issues you are describing.

    I think a good test of whether any transpiled language works well is to look at whether it could work on its own as a language. See: Typescript.

I don't think it's falling apart at all. Personally, I would require that what columns get returned be explicit (optionally with a * type syntax that you have to enable - the defaults should be safe and * has its risks). For one thing, you don't necessarily want to return all the columns you have aggregated. E.g., you may be running the equivalent of a HAVING clause on an aggregate column, so don't need the value returned.

"Each line transforms the previous result." - I assume this is referring to the order that transpilation happens, so you can read it top to bottom and understand the flow easily.

One thing I would like to see is how a recursive CTE might look.