← Back to context

Comment by lmeyerov

10 hours ago

It's interesting to think of where the value comes from. Afaict 2 interesting areas:

A: One of the main lessons of the RAG era of LLMs was reranked multiretrieval is a great balance of test time, test compute, and quality at the expense of maintaining a few costly index types. Graph ended up a nice little lift when put alongside text, vector, and relational indexing by solving some n-hop use cases.

I'm unsure if the juice is worth the squeeze, but it does make some sense as infra. Making and using these flows isn't that conceptually complicated and most pieces have good, simple OSS around them.

B: There is another universe of richer KG extraction with even heavier indexing work. I'm less clear on the ROI here in typical benchmarks relative to case A. Imagine going full RDF, vs the simpler property graph queries & ontologies here, and investing in heavy entity resolution etc preprocessing during writes. I don't know how well these improve scores vs regular multiretrieval above, and how easy it is to do at any reasonable scale.

In practice, a lot of KG work lives out of the DB and agent, and in a much fancier kg pipeline. So there is a missing layer with less clear proof and a value burden.

--

Seperately, we have been thinking about these internally. We have been building gfql , oss gpu cypher queries on dataframes etc without needing a DB -- reuse existing storage tiers by moving into embedded compute tier -- and powering our own LLM usage has been a primary internal use case for us. Our experiences have led us to prioritizing case A as a next step for what the graph engine needs to support inside, and viewing case B as something that should live outside of it in a separate library . This post does make me wonder if case B should move closer into the engine to help streamline things for typical users, akin how solr/lucene/etc helped make elastic into something useful early on for search.

I'm conceptually very bullish on B (entity resolution and hierarchy pre-processing during writes). I'm less certain than A and B need to be merged into a single library. Obviously, a search agent should know the properties of the KG being searched, but as the previous poster mentioned, these graph dbs are inherently inaccurate and only form part of the retrieval pattern anyway.

  • Maybe it's useful to split out B1) KG pipelines from the choice of B2) simple property graph ontologies & queries vs advanced rdf ontologies and sparql queries

    It sounds like you are thinking about KG pipelines, but I'm unclear on whether typed property graphs, vs more advanced RDF/SPARQL, is needed in your view on the graph engine side?