← Back to context

Comment by maximilianroos

4 years ago

> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?

I don't have an example on the Readme, but I was thinking of something like (toy example):

  table newest_employees = (
    from employees
    sort tenure
    take 50
  )
  
  from newest_employees
  join salary [id]
  select [name, salary]

Or were you thinking something more sophisticated? I'm keen to get difficult examples!

Edit: formatting

When you add in the ability to reference different tables like that to the piping syntax, it starts to remind me of the M query language: https://docs.microsoft.com/en-us/powerquery-m/quick-tour-of-...

There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.

I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.

My gut reaction is that if we have "from first" then maybe we should have to "to last":

  from employees
  sort tenure
  take 50
  as newest_employees
  
  from newest_employees
  join salary [id]
  select [name, salary]