← Back to context

Comment by xi

4 years ago

Maybe you'd like to check FunSQL.jl, my library for compositional construction of SQL queries. It also follows algebraic approach and covers many analytical features of SQL including aggregates/window functions, recursive queries and correlated subqueries/lateral joins. One thing where it differs from dlpyr and similar packages is how it separates aggregation from grouping (by modeling GROUP BY with a universal aggregate function).

This is awesome! I'll add a link to it on PRQL.

I guess the biggest difference between FunSQL (and similarly dbplyr) and PRQL is that the former needs a Julia (or R) runtime to run.

I really respect the library and keen to see how it develops.

  • FunSQL.jl requires Julia to run (obviously as it is a Julia library) but it produces standard SQL so Julia in this case is just an implementation language.

    I have re-implemented parts of FunSQL in Python and OCaml (the one I have ended up using) and have added a concrete syntax similar to what you have in PRQL.

        from employees
        define
          salary + payroll_tax as gross_salary,
          gross_salary + benefits_cost as gross_cost
        where gross_cost > 0 and country = 'usa'
        group by title, country
        select
          title,
          country,
          avg(salary) as average_salary,
          sum(salary) as sum_salary,
          avg(gross_salary) as average_gross_salary,
          sum(gross_salary) as sum_gross_salary,
          avg(gross_cost) as average_gross_cost,
          sum(gross_cost) as sum_gross_cost,
          count() as count
        order by sum_gross_cost
        where count > 200
        limit 20
    

    But, in my mind, the biggest difference between PRQL and FunSQL is the way FunSQL treats relations with `GROUP BY` - as just another kind of namespaces, allowing to defer specifying aggregates. A basic example:

        from users as u
        join (from comments group by user_id) as c on c.user_id = u.id
        select
          u.username,
          c.count() as comment_count,
          c.max(created_date) as comment_last_created_date
    

    The `c` subrelation is grouped by `user_id` but it doesn't specify any aggregates - they are specified in the `select` below so you have all selection logic co-located in a single place.

    I think this approach is very powerful as it allows you to build reusable query fragments in isolation but then combine them into a single query which fully specifies what's being selected.

  • Writing an alternative syntax is straight forward. Perhaps prototype PRQL using xi's excellent FunSQL backend? This way it's working out of the gate. Once syntax+semantics are pinned, writing another backend in the language of your choice would then be easier. Getting the backend correct is non-trivial work, and xi has done this already. Besides, we need a sandbox syntax anyway, so it might be fun to collaborate.