← Back to context

Comment by 0xferruccio

14 hours ago

DuckDB is amazing for any sort of fast data analysis when the data is small enough that it can fit on your laptop

Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them

Another thing it's been really useful for has been getting metrics on Claude skills usage and then dive into use-cases by looking at the transcripts

Other engineers that had never touched DuckDB were so impressed with how easy it is for AI agents to write queries on our dataset

>> DuckDB is amazing for any sort of fast data analysis when the data is small enough that it can fit on your laptop

I agree, and the dirty (not so) secret big data providers like Snowflake try to hide: the majority of your work is not big data and WILL fit on your local machine. My last company was spending $2M/yr on contract with Snowflake, and another million between Fivetran and Matillion. Of the 1200 clients using analytics maybe 2 had enough data to warrant "infinite scalability" and a dozen wanted Snowflake because they already had corporate warehouses in Snowflake (they probably didn't need it either). Turns out the Extract and Load could be handled by bog-standard C# code and a bunch of SQL, while almost everyone was better off with a DuckDB database running locally, often in the browser. You've probably heard YAGNI before (You Ain't Gonna Need It) but it's even more likely with "Big Data". #SmallDataConvert

  • Folks have been beating this drum for as long as I've worked in software, dating to the Hadoop era, and it remains true today. So much of "big data" only appears big because it's poorly stored, or is represented wastefully (in persistent storage or in memory).

Like sqlite, duckdb is underappreciated as a production database. You can totally run it on servers or even "serverless" and do some heavy data transformations or with the right server size work with large scale datasets (up to a TB compressed seems fine).

  • This. I've recently used both duckdb and sqlite to power a dashboard for a small restaurant of a family member. It converts all their sales to a very tiny parquet files, daily.

    The file fits in memory and can do all sort of computation in the browser itself. The backend is extremely simple, it just loads the JS and serves the parquet files.

    It was also trivial to let the owner do their own queries, just give the schema to an LLM and let it use the charting library, no data hallucinations. If they need it in the dashboard they can either use that one or ask me to review that query.

    To be honest, given how simple some things became, it's been really fun to work on.

    • Similar experience here. The best thing I've built in a long time is replacing a complex (and scary) permissions system built on top of Snowflake with single role duckdb databases that - aside from no longer worrying about bugs leaking data across roles - are more performant, timely and flexible. Combined with the use of AI this is the way forward IMO.

      At the other end of the spectrum, working with random data on "what if?" and exploration tasks with DuckDB is fun again. it's so straightforward and fast, with tools and functions for pretty much everything.

    • > no data hallucinations

      Dangerous thing to assert. It’ll happily run SQL that works, but doesn’t necessarily correspond to intentions or unstated assumptions about the data.

      1 reply →

    • I have a a theory that LLMs are going to be the death knell of big SaaS. It's so much harder to build and maintain an massive SaaS that does 80% of what 80% of your customers want, than it is to build something small and simple that does 100% of what one customer wants.

      1 reply →

  • Not to mention it can query across heterogeneous sources, so the same query can use a duckdb table, sqlite, csv, and parquet (including predicate pushdown).

Agree, in addition to that DuckDB also works quite well for data that is too big to fit in memory or on the machine DuckDB is on (predicate push down, out of core processing, …).

>Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them

Nice! How do you set things up so that your engineers's claude code sessions upload to S3? Thanks for the help in advance

  • Probably on a business / Enterprise plan, which has managed settings and also telemetry export. Give it a collector endpoint to export to and then have collector send to s3.

  • If you use OpenCode, the sessions are all in a local sqlite database. After lunch I'm pushing one of my agents to crunch some data from that using duckdb...