Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
For Vector Search the top 2 are: Meta’s FAISS and (my) Unum’s USearch.
Lucene powers Elastic, Solr, MongoDB Atlas, AWS OpenSearch, Azure Cognitive Search.
USearch powers ClickHouse, DuckDB, YugaByte, TiDB, ScyllaDB, MemGraph, KuzuDB, Lantern, and a few big closed source names that don’t mention it, as far as I know.
FAISS has the highest usage among Python developers, but if you are indexing large collections you should consider alternatives.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.
Can I ask you which alternatives exist at the layer Lucene occupies?
I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.
For Vector Search the top 2 are: Meta’s FAISS and (my) Unum’s USearch. Lucene powers Elastic, Solr, MongoDB Atlas, AWS OpenSearch, Azure Cognitive Search. USearch powers ClickHouse, DuckDB, YugaByte, TiDB, ScyllaDB, MemGraph, KuzuDB, Lantern, and a few big closed source names that don’t mention it, as far as I know. FAISS has the highest usage among Python developers, but if you are indexing large collections you should consider alternatives.
Interesting, then, that Vectroid would choose to fork it.
Elasticsearch is at least good / at hiding the Lucene zoo under the hood.
co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...
Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.