← Back to context

Comment by stefano

1 year ago

Those are some very big claims with respect to performance. Has anyone outside of the author been able to reproduce the claims, considering you need to pay 100k/month just to do it?

I also wonder if the commercial version has anti-benchmark clauses like some database vendors. I've always seen claims that K is much faster than anything else out there, but I've never seen an actual independent benchmark with numbers.

Edit: according to https://mlochbaum.github.io/BQN/implementation/kclaims.html, commercial licenses do indeed come with anti-benchmark clauses, which makes it very hard to take the one in this post at face value.

I used to use K professionally inside a hedge fund a few years back. Aside from the terrible user experience (if your code isn’t correct you will often just get ‘error’ or ‘not implemented’ with no further detail), if the performance really was as stellar as claimed, then there wouldn’t need to be a no benchmark clause in the license. It can be fast, if your data is in the right formats, but not crazy fast. And easy to beat if you can run your code on the GPU.

  • The last point is spot on... Pandas on GPUs (cudf) gets you both the perf + usability, without having to deal with issues common to stack/array languages (k) and lazy languages (dask, polars). My flow is pandas -> cudf -> dask cudf , spark, etc.

    More recently, we have been working on GFQL with users at places like banks (graph dataframe query language), where we translate down to tools like pandas & cudf. A big "aha" is that columnar operations are great -- not far from what array languages focus on -- and having a static/dynamic query planner so optimizations around that helps once you hit memory limits. Eg, dask has dynamic DFS reuse of partitions as part of its work stealing. More SQL-y tools like Spark may make plans like that ahead of time. In contrast, that lands more on the user if they stick with pandas or k, eg, manual tiling.

  • I've been using kdb/q since 2010. Started at a big bank and have used it ever since.

    Kdb/q is like minimalist footwear. But you can run longer and faster with it on. There's a tipping point where you just "get it". It's a fantastic language and platform.

    The problem is very few people will pay 100k/month for shakti. I'm not saying people won't pay and it won't be a good business. But if you want widespread adoption you need to create and an ecosystem. Open sourcing it is a start. Creating libraries and packages comes after. The mongodb model is the right approach IMO

    • Can you elaborate on what mongo did right? My understanding is that AWS is stealing their business by creating a compatible api

      1 reply →

  • Would you recommend K?

    Is something else better (if so what)?

    • I think the main reason to use any of these array languages (for work) is job security. Since it's so hard to find expert programmers if you can get your employer to sign off on an array language for all the mission-critical stuff then you can lock in your job for life! How can they possibly replace you if they can't find anyone else who understands the code?

      Otherwise, I don't see anything you can do in an array language that you couldn't do in any other language, albeit less verbosely. But I believe in this case a certain amount of verbosity is a feature if you want people to be able to read and understand the code. Array languages and their symbol salad programs are like the modern day equivalent of medieval alchemists writing all their lab notes in a bespoke substitution cipher. Not unbreakable (like modern cryptography) but a significant enough barrier to dissuade all but the most determined investigators.

      As an aside, I think the main reason these languages took off among quants is that investing as an industry tends toward the exultation of extremely talented geniuses. Perhaps unintelligible "secret sauce" code has an added benefit of making industrial espionage more challenging (and of course if a rival firm steals all your code they can arbitrage all of your trades into oblivion).

      4 replies →

    • Are you a quant, or doing very specialized numerical things that aren't in libs on 2D datasets? Then 10X yes. If not, no. Everything else will be better

      2 replies →

Quick for building a website? Probably not.

Quick for evaluating some idea you just had if you are a quant? Yes absolutely!

So imagine you have a massive dataset, and an idea.

For testing out your idea you want that data to be in an “online analytical processing” (OLAP) kind of database. These typically store the data by column not row and other tricks to speed up crunching through reads, trading off write performance etc.

There are several big tech choices you could make. Mainstream king is SQL.

Something that was trendy a few years ago in the nosql revolution was to write some scala at the repl.

It is these that K is competing with, and being faster than.

  • I would probably use Matlab for that sort of stuff tbh. Is K faster than Matlab?

    • For using a heap of libs to compute something that has been done a thousand times? No. For writing a completely new data processing pipeline from scratch? Much, much faster! Array langs have decent performance, but that's not why they are used. They are used because you develop faster for the kind of problem they're good at. People always conflate those two aspects. I use J for scientific research.

      10 replies →

    • I think some things will be faster and the real difference is more in things that are easy to express in the language. Eg solving differential equations may be easier in matlab and as-of joins in k.

Array languages are very fast for code that fits sensibly into arrays and databases spend a lot of compute time getting correctness right. 100x faster than postgres on arbitrary data sounds unlikely-to-impossible but on specific problems might be doable.

  • Yes, when I took a look at shakti's database benchmarks before, they seemed entirely reasonable with typical array language implementation methods. I even compared shakti's sum-by benchmarks to BQN group followed by sum-each, which I'd expect to be much slower than a dedicated implementation, and it was around the same order of magnitude (like 3x slower or something) when accounting for multiple cores. I was surprised that something like Polars would do it so slowly, but that's what the h2o benchmark said... I guess they just don't have a dedicated sum-by implementation for each type. I think shakti may have less of an advantage with more complicated queries, and doing the easy stuff blazing fast is nice but they're probably only a small fraction of the typical database workload anyway.

The comparison posted may be true, but on some level it doesn't make sense. It's like this old awk-vs-hadoop post https://adamdrake.com/command-line-tools-can-be-235x-faster-...

Yes, a very specific implementation will be faster than a generic system which includes network delays and ensures you can handle things larger than your memory in a multi-client system. But the result is meaningless - perl or awk will also be faster here.

If you need a database system, you're not going to replace it with K, if you need super fast in-memory calculations, you're not going to use a database system. Apples and oranges.

They at least used to have free evaluation licenses that were good for a month. Our license was even unlocked for unlimited cores.

I doubt they'd give them out to a random individual or small startup, but maybe still possible for a serious potential customer.