← Back to context

Comment by cluckindan

20 hours ago

How is this different from running tuned HNSW vector indices on Elasticsearch?

Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.

  • Interesting, then, that Vectroid would choose to fork it.

    Elasticsearch is at least good / at hiding the Lucene zoo under the hood.

  • Can I ask you which alternatives exist at the layer Lucene occupies?

    I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.

co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...

Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.