← Back to context

Comment by ewild

4 days ago

the original chunk is most likely stored with it in referential format such as an id in the metadata to pull from a DB or something along those lines. I do exactly what he does aswell and i have an Id metadata value that does exactly that pointing to an id in a DB which holds the text chunks and their respective metadata

The original chunk, sure, but what if the original chunk is full of eg pronouns? This is a problem I haven't heard an elegant solution for, although I've seen it done OK.

What I mean is, how can you derive topics from a chunk that refers to them only obliquely?

  • Before chunking, run coreference resolution to get rid of all of your pronouns and replace them with explicit references. You need to be a bit of careful to ensure you chunk both processed and unprocessed versions in the same places but it’s very doable.

    If you haven’t seen it, there’s a lovely overview of the idea in one of the SpaCy blog posts: https://explosion.ai/blog/coref