Advent of Code 2024 in pure SQL

3 months ago (databasearchitects.blogspot.com)

I reacted to this title the way I react a new menu item at Taco Bell: a strange mixture of desire, shame, and admiration for human ingenuity.

  • I work a lot with databases and I've seen... stuff. It's not as bad as you might think if you know what you are doing. Most RDBMSs support recursive CTEs, it feels like writing Prolog with a slightly sadistic syntax. For something like AoC the most difficult part is probably parsing the input.

    • Speaking of parsing, back around y2k we were building an app that used XML everywhere, which was the style at the time, and our DBA wanted to write an xml parser in SQL (the api would involve sending XML to the database). That got vetoed.

      IMO, this kind of thing is what AoC is good for - you get to play with weird/obscure stuff without affecting your day job code.

      3 replies →

    • I do AoC in SQL, I wish it was true. With Postgres, you have lots of regex/string manipulation functions that make it easy.

      For me, the biggest problem was memory. Recursive CTEs are meant to generate tables, so if you are doing some maze traversal, you have to keep every step in memory, until you are done.

    • It is closer to Datalog I think, or can you express cut? CTEs are fairly restricted compared to logic programmming languages though, at least for Postgres. In particular, relations cannot be mutually recursive and your rules may only be linearly recursive in themselves (i.e. can contain only one instance of themselves in the right hand side). Postgres is overly restrictive in the latter and requires at most once recursive reference over all subqueries in the UNION even though it would be safe to only restrict the number of recursive calls for each subquery (each corresponding to a separate Datalog rule for the same relation). It is possible to work around that restriction using a local WITH expression (a hack really), but then you are also on your own since it disables all checks and allows you to write rules which uses actual nonlinear recursion and will give incorrect result sets when evaluated.

      I really would like Postgres to have proper support for writing Datalog queries, and with better and more efficient incremental evaluation algorithms as opposed to the iterative semi-naive algorithm that is only supported now.

      3 replies →

    • A company I worked for uses Syteline ERP which heavily relies on SQL Server. But the DBA was constantly complaining about how slow the Syteline SQL was. One major issue was long running transactions taking 10 minutes locking rows/tables for too long and using a lot of memory. You would think very expensive ERP systems would have decent SQL.

    • > Most RDBMSs support recursive CTEs, it feels like writing Prolog with a slightly sadistic syntax.

      Which makes sense as both are declarative logic-based languages. IMHO, SQL and Prolog fundamentally have much in common.

      1 reply →

    • parsing is most difficult for probably the first third of the problems. when you get to day 19 or so, the input is still just a grid or a bunch of ints just like day 1, but the algorithms required are considerably more challenging than the parsing part. (I've done all 25 problems in all years)

    • Thanks for that comment.

      I laughed aloud at "It's not as bad as you might think if you know what you are doing."

      ... because that pretty much describes all human activity :-)

  • I'm as equally amazed by the solutions in this post's github repo as I am with Taco Bell's new chicken nuggets.

    • ⊂ •͡˘∠•͡˘ ⊃

      Suspicious. Need to investigate if taco bell has interesting ANSI SQL flavored chicken nuggets or I've been taken for a fool!

  • What I can't stand about Taco Bell is the fake nacho cheese. The ordinary grated cheese that is on (say) a hard taco is OK even if it's not the best, but it takes me a strong act of suppression to choke down anything with Velveeta in it. Maybe their tablet interface can be drilled into to get ingredients but as it is it's a game of chance. (Funny there is a taco stands that has the best street tacos I've seen anywhere except LA a block from Taco bell but too often it's not open when I'm there)

    Seriously though,

    https://www.amazon.com/Joe-Celkos-SQL-Smarties-Programming-d...

    is a masterclass in extreme SQLmanship.

  • but why? what would make you react at human ingenuity with shame and desire? is this something about you or something about them in particular? isnt the whole of HN about human ingenuity...?

    are we to feel Taco Bell menu about it all, what am I missing?

Nicely done. I know this seems crazy at first but in my opinion big SQLs are one of the best ways to store complexity.

The problem being complex is the issue. SQL is a standard, condensed, extremely performant, actually testable, and logical language. Sure, not anybody can instantly maintain it but that would be the same as if it was a lot of lines and functions in Java. The more lines the more risk for bugs.

I also love how deep SQL goes. And that makes totally sense. It powers the world of data since 40+ years so of course people asked for niche features. One of my favorite is the model clause in Oracle with which you can implement multidimensional arrays. A friend implemented Conway's game of life with it in way less lines you expect.

  • Once upon a time, as an intern, I had the 'fun' task of optimizing the performance of a stored procedure written by someone with a math phd. It was more than 6 pages when printed, and took more than 30 minutes to run (it was used in billing), and had no tests.

    Ended up rewriting it in native code, and it run in less than a second. Most of the work was proving it produced same results... and writing and documenting test cases, so next person wouldn't have to go through that.

    After that experience, I have generally avoided putting a lot of business logic in SQL...

    • I definitely see how this happens. SQL is easy to start with but sup hard to master. And it requires good data models.

      But if both are fine there is no way that any native code will be faster than a database. No network latency, set operations, index usage, etc. DBs have all the the info to make data access fast.

    • I wrote a 120 line sql procedure that replaced a 8k line java royalty payment processor.

  • > in my opinion big SQLs are one of the best ways to store complexity.

    Only if, ONLY IF, you have a lot of people that are well versed in SQL. It's very easy to write bad SQL. It's difficult to unravel thousands of lines of bad SQL spread across hundreds of procedures / views / functions. Ask me how I know...

  • I get this feeling as well, but then again debugging large SQL queries can be very opaque. Things like pl/pgsql help, but then it starts becoming more and more like a normal programming language.

  • > but in my opinion big SQLs are one of the best ways to store complexity.

    It seems crazy at first, and then I continue thinking about it and it still seems crazy for wanting complexity to be in SQL.

    Personally, I want anything complex to ideally easily testable, both manually and automatically. SQLs is easy to test manually, but harder to test automatically, especially compared to lines of code in a programming language. At least you can somewhat untangle balls of spaghetti code into less dense, then attack parts of it. How would you deal with a tangled ball of SQL spaghetti?

    > The more lines the more risk for bugs.

    I don't fully agree with this either, not all lines are equal. One big fat line of a 400 character long SQL query has higher chance of containing issues not easy to glance compared to 400 lines of Java code, and I say this even as a person who despises Java for various reasons.

If you like this kind of degeneracy, I tried AoC in Google Sheets this year.

I only made it to Day 6, and not even both stars every day. I'm pretty confident my Day 7 solution is correct, but I hit per-cell character limits on longer inputs.

Enjoy :) oh but don't open it on mobile, some sheets crash the app

https://docs.google.com/spreadsheets/d/10FY-89y19tnRM_EAAnCd...

  • I'm on my phone so I can't open it now.

    Does it use Google App Script? Would be a way to gain extra power I think.

    • > Does it use Google App Script? Would be a way to gain extra power I think.

      Isn't that just straight up JavaScript at that point? Feels kind of like cheating if the goal is to complete AoC with a spreadsheet.

Over my career I've certainly written more SQL than any other type of code. Not so much in the last five years so I'm sure I've lost some of it, but I used to really enjoy it. Once you stop thinking iteratively and start thinking in set operations it becomes quite natural and powerful.

  • Over the years I have been pushing more and more responsibilities into the RDBMS. I now see things mostly in terms of ETL, SQL and schema. Virtually every conversation I've ever had about the application of technology to the business could be expressed in these terms. Business logic defined as SQL queries can be quite intuitive when the schema is structured well and aligned to the business stakeholders' perspectives.

    Code, frameworks, ORMs, "best practices", patterns, et. al. are ultimately a distraction. There are a million ways to get the data in & out of the database. Moving the bits around is not valuable. There are many overblown software solutions out there that could have been a simple merge statement or CSV import job.

    I think that a lot of the misconceptions and bad moods about SQL come out of being forced to work with nasty schemas. The language itself is really domain specific. Consider that one wouldn't complain as much about a super fucked up nested query (and resulting SQL syntax woes) if it wasn't necessary to write such a query in the first place. Aligning tuples and relations to the way the business typically talks about them means you will be less likely to be fighting these things over time. Often, it isn't possible to refactor the schema from zero, but you can put replicas/views around a "bad" schema and target it with your new development & refactors.

    • > Consider that one wouldn't complain as much about a super fucked up nested query (and resulting SQL syntax woes) if it wasn't necessary to write such a query in the first place.

      And in "modern" SQL this is solved with CTEs. Use them to unwrap the schema the way you want it first, before implementing the business logic.

      7 replies →

  • I second all of that!

    I wish more people would see the beauty. After a session of SQL, when I take a step back and think.

    "Hold on. What I have been doing lately is just pure logic. No library dependency resolution, no concurrency problems (even though massive concurrency is certainly under the hood). No mutability issues. Just logic."

    SQL obviously has its warts, some of them serious, like testability. But at the end of the day, I wish all programming was like that. Let the computer decide how to do stuff under the hood. And let the human focus on the logic.

    I have somewhat half-assed tried to read up on Prolog for trying to take it to the next level, but failed sofar unfortunately. (It was also a goal to try to unlearn some SQL to avoid getting stuck in some local optimum). Maybe somewhere between SQL and Prolog is the future of programming.

    • > Maybe somewhere between SQL and Prolog is the future of programming.

      Must be Datalog then ;)

    • Prolog is very powerful, if you see what professionals can do with it it's eye opening. Unfortunately, it takes a complete relearning of programming to achieve that proficiency level. And after you reach it, you probably cannot use it in your day job...

      3 replies →

    • > Maybe somewhere between SQL and Prolog is the future of programming.

      it was, it most probably is

  • > Once you stop thinking iteratively and start thinking in set operations it becomes quite natural and powerful.

    I dunno... I've written a tremendous amount of SQL, and I still have to think imperatively (iteratively) in order to write queries that are actually performant, and to know which indexes need to exist.

    It would be wonderful if I could just think in terms of set operations, but that tends to result in queries that take 5 minutes to execute rather than 5 milliseconds.

    My entire thought process is basically -- what table do I start with, what rows in what order, joining to what, under what conditions, aggregating how, rinse and repeat... It's entirely a mental model of loops and aggregation, never of set operations.

    • It may be true, until you do your ETL in an index-less database such as BigQuery or Trino. Postgres will always be faster for optimized, end user serving, queries.

      But BigQuery allows you to scale it to 100s of CPUs without having to worry about indexes.

      5 replies →

  • Being able to master the theoretical, practical, and skill-based components of designing a good database schema is the absolute truest test of understanding any systems design.

    People skip ahead to all kinds of nonsense; but most of software engineering is putting the right data into the right format, and moving it around reliably.

    I just did a major refactor of a complex distributed code base. I pretty much only count the work I did on schema re-design as the actual “job”, the rest was many hours of coding, but that’s really just implementation.

    There are other ways to define schema than SQL of course, but it’s really the perfect way to learn true systems engineering.

    • Very true. My manager at one of my first jobs liked to say "get the data model right and everything else will be easy" and that has largely been proven true in my experience (and it even applies if you're not using an RDBMS).

  • Yeah, I also kind of like coding in SQL, with PL/SQL being my favourite extension language, which is kind of heresy in HN, but whatever.

I have been writing a ton of SQL -- implementing a lot of the business logic of a (stream processing) application in it. I really really like it, especially that I bring the computation to the data instead of the data to the computation.

I often talk to developers who hate that idea though. They want me to instead move all the data to the backend, for a massive IO hit, just so that the computations can be expressed in a "real" programming language.

I think SQL the concept is good but SQL the language is the problem. There are so many awkward things in it -- not strange as it has had no competition in 40(?) years!! The mental model of the program I write is fine but I really need to overlook the syntax and to the program I am really writing to see any elegance...

What we need I think is a properly designed programming language, designed for existing databases (Postgres, MSSQL) compiling to SQL dialects. I see some contenders but they all have some attachment to a special domain, such as not allowing data modifications (PreQL) or being associated with other databases.

Itching to do it myself, but it's a lot of work and a long long road to any adoption with no guarantee of success, and no revenue stream that I can think of.

The most popular backend languages were made by large companies, but I think coding in SQL is in a catch-22 where it will be frowned upon until there is a better language and no better language until it is more popular..

  • Me too me too :D

    There's a lot that is very right about SQL, but a few clunky bits around the edges.

    CTEs made a lot of difference, and window functions - which can be a bit head bending - made difficult things a tiny bit easier.

    I'm using BigQuery, which supports structs and arrays, but only recently allowed arrays to be grouped by, although there is still no equality check etc.

    BigQuery is slowly adding more sugar, like aggregate UDFs and polymorphic UDFs using ANY TYPE parameters etc, and I find myself putting more reused logic into tidy functions, but my pet want is for temporary functions to be declared and scoped like CTEs so they integrate a lot better into tooling like DBT that wants everything to be in one statement.

    And the one most productive thing they could add? Allowing you to specify null behaviour on JOIN USING. (Having to spell out foo.bar IS NOT DISTINCT FROM bar.bar on a join is unobvious and ugly. Something like USING (bar RESPECT NULLS) would be so much nicer.)

  • I can't put my finger on it but I think many people see this as two operating modes, for lack of a better term. The more monolithic and enterprisey your solution (and with bespoke DBMS), the more it leans towards anything more complex than an index and maybe a couple triggers - and the more microservices-y where every small service owns its own database (and only half of them are RDBMS) the less complex code is desired in the DB itself, because you're also migrating off of single instances/clusters a lot (just taking a relatively dumb data dump with you, or even just adding new replicas, ship-of-theseus-like).

  • PRQL is awesome. There's a similar competitor whose name I can't find now.

Doing it in pure SQL is really impressive but I think the real tell-tale sign of peak "cracked engineer energy" is the maintained, decade-old blogspot site. Can't exactly put my finger on it, but really gives off "niche mastery". I don't even know the authors but I'm sure in the right circles a few dudes maintaining a blogspot site called "database architects" for a decade probably don't need an introduction.

Long ago I was interviewing for an operations job, and their "leetcode" interview question was to create an invoice report of some fairly large public data-set. Because of the size of the dataset and the way they wanted it sliced and diced, it wasn't just a straightforward set of JOINS, and they clearly wanted you to do some simple SQL queries and then have some code loops that sliced and diced it all.

That wasn't a solution the data scientist in me (which I'm not) was ok with. I ran my high level thoughts about the problem past a data scientist buddy of mine and he said "sounds like you're on the right track", but not being a data scientist it was hard for me to put together a soludion.

I told them that if I was asked in my work capacity to do something like this, I'd probably be reaching for a reporting package like Crystal Reports, but I haven't touched it in ~30 years. "Sure, I get that, just write some code to generate the report".

I had written all the Ansible and related goodies to spin up an instance, get MySQL set up and configured, and figured the "right" solution was there in the SQL arcana. I played with it and played with it, mostly writing off the job because what kind of company judges a sys admin based on building a reporting package? They had set the expectation that the homework should take 2-4 hours, and I chewed on it for longer than that and finally said "thanks but no thanks, this is outside my wheelhouse".

But I kept chewing on the problem. A couple weeks later I had the right set of materialized views and queries and joins that not only solved the problem entirely in SQL, but solved it QUICK.

SQL is amazing, but there are a log of tricks of the trade you've got to have at your fingertips to make it fly. I'm still convinced that a lot of it is just throwing a bunch of things at the wall until something sticks.

I wrote a cubing/containerization system in SQL, but it did use the additional capabilities in a sproc for looping etc., so not just a single SQL stmt.

But the problem actually mapped well to SQL's capabilities in some ways. The basic algorithm was:

1-Loop through items in descending size order

2-Test X different positions and orientations such that one corner is butted up against an existing corner (of container, or previously placed items, etc.)

3-Choose the winner based on some sort of point systems that encapsulates some set of rules (e.g. preferred orientation, or minimizes small gaps between sides of items, etc.)

These were some aspects that lined up well with SQL:

A-Testing of X number of positions and orientations all with a relative simple statement, using as input: existing corner points from previously placed items, some control table with the variations in X,Y,Z orientations.

B-The ability to maintain a set of previous options for each iteration (e.g. item #2 was tested in these positions with these stats and all are temporarily being considered reasonable options), add new item to each one of those previous tests and accumulate stats. It was pretty easy to configure a "window" of how many different open options were being tracked to optimize the result without overwhelming memory. The SQL to include the options was almost the same as if there was only one previous chosen option.

Some aspects were a bit painful to shift the model mentally to fit in with SQL.

I was crazy enough to try this as well this year. It would be an extreme stretch to consider me a SQL expert, but I did make it to Day 14.

I agree with all the points in the post. It's really not that bad. If I had more time to devote to it, I think I could have reasonably completed more.

Here's my writeup, contains a link to the repo if you want to see some of the soutions.

https://github.com/ty-porter/advent-of-code/tree/master/2024...

The problem with simple silver bullet solutions is that the best ones can handle maybe 80% of reasonable requirements... But it's absolutely impossible to solve any problem which happens to fall in the remaining 20% of problems. This fact is as true as ever and yet marketing of silver bullets is at an all-time high.

I teach an introductory coding course and I was surprised when one student asked me to give them some reasons why learning to code is useful instead of just using a platform like Wix. A similar question came up again (concerning a different website-building platform) in a different discussion with a family member.

I was kind of shocked that the distinction between a website and a data-driven application seems to have faded out of many people's consciousness. I'm guessing that you can probably add dynamic widgets to Wix and other similar platforms which make it seem like you can build complete apps but they can rarely get you all the way to your goal for a long-term project... And when you hit that wall, you almost need to learn the entire field of computer science from scratch just to get that last 10% of requirements implemented. You go from not having to know anything at all so literally understanding everything about computer science just to get the last 10%; or hire someone (and hope that they have the skills you need).

The data parsing must have been painful. High wizardry.

  • I feel like it'd be entirely acceptable to load the inputs into tables first and still qualify as pure sql, because string parsing in sql is so blergh

    • Input in AoC is always well-formed. And you can always use a regexp? Seems like the smallest problem to me. As soon as you get past that recursive one row per line trick in the posted solutions

      1 reply →

  • The data parsing is just one recursive query to read one line at a time, and it is done pretty much the same way in each problem.

When you realize recursive SQL allows iterative calculation then you can pretty much do whatever you want, no matter how bad.

It takes an incredible human to do something like this. Pure art. Not enough of that in the programming world.

  • Thomas is the greatest database systems researcher in the world. He is an incredible human.