Comment by theptip

7 months ago

Moving beyond the specific ground truth example, how much of the eval can be automatically verified, vs requiring a human baseline to check?

Eg I can imagine invariants like balancing anccounts are essentially mechanical, but classifying spending categories currently requires judgement (and therefore human-curated ground-truth). But I’m curious if there are approaches to reduce the latter, say with constructing a semantic graph ontology for the domain or something along those lines.

I guess there is an interesting duality here in that if you solve the eval you have also created a valuable business!

0 comments

theptip

No comments yet

Contribute on Hacker News ↗