Comment by visarga

4 days ago

My chunk rewriting method is to use a LLM to generate a title, summary, keyword list, topic, parent topic, and gp topic. Then I embed the concatenation of all of them instead of just the original chunk. This helps a lot.

One fundamental problem of cosine similarity is that it works on surface level. For example, "5+5" won't embed close to "10". Or "The 5th word of this phrase" won't be similar to "this".

If there is any implicit knowledge it won't be captured by simple cosine similarity, that is why we need to draw out those inplicit deductions before embedding. Hence my approach of pre-embedding expansion of chunk semantic information.

I basically treat text like code, and have to "run the code" to get its meaning unpacked.

If you ask, "Is '5+5' similar to '10'?" it depends on which notion of similarity you have - there are multiple differences: different symbols, one is an expression, the other is just a number. But if you ask, "Does '5+5' evaluate to the same number as '10'?" you will likely get what you are looking for.

How do you contextualize the chunk at re-write time?

  • the original chunk is most likely stored with it in referential format such as an id in the metadata to pull from a DB or something along those lines. I do exactly what he does aswell and i have an Id metadata value that does exactly that pointing to an id in a DB which holds the text chunks and their respective metadata

    • The original chunk, sure, but what if the original chunk is full of eg pronouns? This is a problem I haven't heard an elegant solution for, although I've seen it done OK.

      What I mean is, how can you derive topics from a chunk that refers to them only obliquely?

      2 replies →