Comment by n_e

14 hours ago

I process TB-size ndjson files. I want to use jq to do some simple transformations between stages of the processing pipeline (e.g. rename a field), but it so slow that I write a single-use node or rust script instead.

I would love, _love_ to know more about your data formats, your tools, what the JSON looks like, basically as much as you're willing to share. :)

For about a month now I've been working on a suite of tools for dealing with JSON specifically written for the imagined audience of "for people who like CLIs or TUIs and have to deal with PILES AND PILES of JSON and care deeply about performance".

For me, I've been writing them just because it's an "itch". I like writing high performance/efficient software, and there's a few gaps that it bugged me they existed, that I knew I could fill.

I'm having fun and will be happy when I finish, regardless, but it would be so cool if it happened to solve a problem for someone else.

  • I maintain some tools for the videogame World of Warships. The developer has a file called GameParams.bin which is Python-pickled data (their scripting language is Python).

    Working with this is pretty painful, so I convert the Pickled structure to other formats including JSON.

    The file has always been prettified around ~500MB but as of recently expands to about 3GB I think because they’ve added extra regional parameters.

    The file inflates to a large size because Pickle refcounts objects for deduping, whereas obviously that’s lost in JSON.

    I care about speed and tools not choking on the large inputs so I use jaq for querying and instruction LLMs operating on the data to do the same.

This reminds me of someone who wrote a regex tool that matches by compiling regexes (at runtime of the tool) via LLVM to native code.

You could probably do something similar for a faster jq.

This isn't for you then

> The query language is deliberately less expressive than jq's. jsongrep is a search tool, not a transformation tool-- it finds values but doesn't compute new ones. There are no filters, no arithmetic, no string interpolation.

Mind me asking what sorts of TB json files you work with? Seems excessively immense.

  • > Uses jq for TB json files

    > Hadoop: bro

    > Spark: bro

    > hive: bro

    > data team: bro

    • JQ is very convenient, even if your files are more than 100GB. I often need to extract one field from huge JSON line files, I just pipe jq to it to get results. It's slower, but implementing proper data processing will take more time.

Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm sure there are reasons against switching to something more efficient–we've all been there–I'm just surprised.

  • > Now I'm really curious. What field are you in that ndjson files of that size are common?

    I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.

    • So what's the use case for keeping them in that format rather than something more easily indexed and queryable?

      I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.

      7 replies →