Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
Interesting, then, that Vectroid would choose to fork it.
Elasticsearch is at least good / at hiding the Lucene zoo under the hood.
Can I ask you which alternatives exist at the layer Lucene occupies?
I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.