Comment by chatmasta

5 months ago

Yeah, I feel like these libraries are all one level lower than what I’m asking for. We need something that makes more assumptions (e.g. “I’m running as a component of some kind of database”) but… makes less decisions? Is more flexible? Idk. This is the hard part.

DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.

I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.

Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?

0 comments

chatmasta

No comments yet

Contribute on Hacker News ↗