← Back to context

Comment by theptip

7 hours ago

Moving beyond the specific ground truth example, how much of the eval can be automatically verified, vs requiring a human baseline to check?

Eg I can imagine invariants like balancing anccounts are essentially mechanical, but classifying spending categories currently requires judgement (and therefore human-curated ground-truth). But I’m curious if there are approaches to reduce the latter, say with constructing a semantic graph ontology for the domain or something along those lines.

I guess there is an interesting duality here in that if you solve the eval you have also created a valuable business!