Comment by viksit
5 days ago
mostly aligned on this. couple of thoughts:
- raw accuracy is now a "vanity" metric. so the benchmarks need to get more sophisticated, and i think they're going to have to be far more task specific than hotpot or hover. they've become like the mnist of multi hop.
- in my use of MIPROv2 and SIMBA, I see a fair amount of improvements for multi hop tasks (published some of these on hn before). I'm going to try GEPA and see how it performs. so I think we're at the start of what I would call "meta learning".. tuning across a huge search surface rather than tweaking one prompt. hyper param search for higher dim spaces.
- tokens burned should be a reported result
No comments yet
Contribute on Hacker News ↗