← Back to context

Comment by galkk

4 years ago

I'm quite opposed to the idea "from should be first".

I want to understand what exactly the query returns, not the implementation detail of the source of this data (that can later be changed).

Literally first example from page - I have no idea what is being returned:

    from employees
     filter country = "USA"                           # Each line transforms the previous result.
     let gross_salary = salary + payroll_tax          # This _adds_ a column / variable.
     let gross_cost   = gross_salary + benefits_cost  # Variables can use other variables.
     filter gross_cost > 0
     aggregate by:[title, country] [                  # `by` are the columns to group by.
          average salary,                              # These are the calcs to run on the groups.
          sum     salary,
          average gross_salary,
          sum     gross_salary,
          average gross_cost,
          sum     gross_cost,
          count,
 ]
     sort sum_gross_cost                              # Uses the auto-generated column name.
     filter count > 200
     take 20

of course, similar things are happening to SQL too, with CTEs becoming more widespread and "real" list of the columns hidden somewhere inside, but it's still parseable

    SELECT id, name, author

Quick, what is this query about? What's ironic is that I think you have it backwards: the columns are the implementation detail, not the table. The table is the context: you can't change that without having to change everything else. But columns are the last step, the selection after the filters, joins, etc. They can be changed at any time without affecting the logic.

  • This is... An odd choice. I'd assume I'm not without context looking at a query to know why I would want those columns.

    And the auto complete story is backwards. Often I know what columns I want, but I'm not clear what table I need to get them from. Such that, if you make a smarter suggest in the from to only include tables that have the columns, I'd be much happier.

    • Just throwing in another point of anecdata onto this pile: "Often I know what columns I want, but I'm not clear what table I need to get them from" does not make sense to me. I don't relate at all to their being a global namespace of columns, rather than a namespace of tables, each with its own columns specific to its context.

      4 replies →

  • It's query asking for the id, name, and author fields. Very straightforward, I have no idea how this is confusing.

    > The table is the context: you can't change that without having to change everything else.

    Except even in the provided single-table example this isn't true - you're getting subselected/CTEd results. No functional joins are demonstrated unfortunately.

    For example:

       from employees
       left_join positions [id=employee_id]
    
       ...is equivalent to...
    
       SELECT * FROM employees LEFT JOIN positions ON id = employee_id
    
    

    No data is selected from positions in either example, and it's unclear on why we're joining that table (other than just for the heck of it). It's not a workable example.

    • You restated the query; I was asking what it's about. Is it a query across publications? Or is it a query over news articles? That context changes everything: how the query is written, what it can be joined with, how it can be filtered, how it is used, etc. Putting the FROM clause first means that you immediately have context to understand the rest of the query.

      2 replies →

The big advantage of "from first" like we have in Kusto KQL (a database we use at Microsoft) is that it provides much better autocomplete (if I write the `from` it can easily autocomplete the projection).

If you want an interesting example of how a query language built for developer experience and autocompletions looks definitely check it out!.

  • That's interesting because it also explains why I was going to say I do like having from first. When trying to reason about a query, I mentally go through the following:

    1. What tables are being pulled from? This speaks to the potential domain of the query. 2. What data is being selected (I can now know what is or isn't being pulled from the aforementioned tables...) 3. What operations, aggregations, groupings, etc. are being performed to work on the pulle data

    Of course from vs select ordering is completely arguable, but my thinking process seems to follow that of the auto complete--in other words that my cognitive load of looking at the select statement is lessened when I know from what the columns are being selected.

    It also follows (at least to me) the mental process of writing the query. First look at the tables, then decide what columns, then decide what functions to apply.

  • I said it in a sibling, but I feel this is somewhat missed. Auto complete that simply lists the tables is easier if from is first. But... Auto complete that helps me know what tables can give me my requested columns works the other direction.

  • Designing languages around autocomplete is like designing toilets for better toilet paper dispensers.

    The language should be right for human understanding, not automated mad-lib generation.

    • You would think that but having used both I find writing Kusto/KQL much smoother, neater and faster and if I have to choose between writing a query in either one I'd pick KQL.

      I understand this is just an opinion but it's an opinion held by everyone in my org who writes both.

      Theoretical correctness loses to pragmatism a lot and I'd read the KQL every day. Look at the examples at https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q... - look at the examples at https://docs.microsoft.com/en-us/azure/data-explorer/write-q... and tell me they're not more readable than comparable SQL?

      (I can see the result type both by hovering on the query but also by just looking at the end of it - and in SQL most of the SELECTed items in complex queries are from subqueries anyway - at least in my use case)

      2 replies →

    • Building for autocomplete is building for human understanding. If it is impossible for a computer to determine the context of your query, why would a human do much better?

      4 replies →

    • I’ve been frustrated by toilets where I have to contort my body to reach the dispenser. Similarly, I’ve had dispensers intrude on the space where my legs would normally be and make it awkward to even just sit on the toilet.

      Toilets are absolutely designed to make the dispenser placement convenient. You just don’t think about it because 95% of toilets get it right, so it just doesn’t bother you that much that it can be wrong.

      In SQL, some decisions are right about 10% of the time and are annoying and awkward the other 90%.

      That’s why the order matters. Because everything else got it right.

    • I'd agree if there was any way whatsoever of fixing this issue, but there simply isn't. The editor can't even begin to guess what you might want until you write your FROM.

      6 replies →

    • And one of use cases is writing queries which it helps immensely. Best of both worlds would allow both orders. Just automatically transform the query to the usual form after it's execution.

I think it's quite a common convention in engineering - not just software - that the input to a process "goes in the top and out the bottom". We humans read top->bottom (regardless of left/right/vertical, I don't know any languages that write bottom up). Conventional voltage in circuit diagrams usually flow top to bottom. Gravity loads in schematics flow top to bottom. Chemical pathways are usually written top to bottom. And of course functions take arguments up top and return at the bottom, maybe with some short circuits. I think the only counter example of note is distillation columns.

Where is the data coming from? Employees table. What's coming out? 20 rows of sum_gross_cost.

What could improve this is function signatures. It's kind of nice to have the whole abstraction up top...like an abstract.

I agree that the columns of the results should be more obvious. But I am a proponent of "from should be first". I have never written a SQL query without thinking about the contents of a table or its relations. If it was my way, I would describe where the data I'm pulling from, then describe any filters/joins, then describe the columns that I'm interested in (last).

  • You've never authored a SQL query that does things like check special functions that don't exist in a table?

    For example:

        select @@version

    • Seems easy enough to work around with a magic table name in this hypothetical future reworked dialect of SQL?

          from @@special
          select version

      1 reply →

It's a fair sentiment, but it can be handled without losing directional flow and composability, some of the bigger advantages of reworking SQL.

One idea would be along the lines of a function prototype: a declaration, up front, about the columns and types that a query is expected to return. It's a good place to put documentation, it's redundant information which should protect against mistakes but not so redundant that it would be too taxing - the author should know what the query returns. The prototype would only be used for validation of column names and types.

Another idea would be requiring the last element in a query to be a projection, a bit like the return statement in a function body: here's what I'm returning out of the grand set of symbols available (e.g. via various joins) in scope from previous operations in the flow.

I'm also completely unfamiliar with the PRQL syntax, outside of right now.

Reading the comment however, it would seem that `let` adds columns which are implicitly returned in the order they are defined.

I do see benefits in this, and can imagine pitfalls. Hard to judge without kicking the tires.

Update: It's quite possible we saw different syntax!

https://news.ycombinator.com/item?id=30063266

Without the `let` I would imagine having trouble reading it as well, I'm not sure if that would go away with familiarity but my instinct is that it's a useful addition.

This feels like a English-language thing. In english we tend to put our adjectives first, it feels natural, "Where is my red, round ball?", rather than some other languages (like German) where you put the subject first. Equivalent of "Where is my ball, red & round?"

While it inherently feels unnatural I do agree with the others here that the context is actually easier to understand once over the initial uncomfort.

_from_ is kind of one of the most important context about the data being returned. It provides the type information. Columns you select are just properties of that type.

In SQL, where _from_ is placed at the end, we are essentially writing equivalent of 'property.object'; eg.: name.person, age.person

Both CTEs and this idea address the same problem: poor readability of complex SQL queries. Compared to CTEs, the author takes the idea to split the complex query into parts to the next level.

To your point - a solid IDE will show you what's being processed at each line (or returned, if the cursor is on the last line) - in an autocomplete window or a side panel.