Comment by ryeguy_24

7 months ago

Isn’t there a whole bunch of dependency here related to prompting and methodology that would significantly impact overall performance? My gut instinct is that there are many many ways to architect this around the LLMs and each might yield different levels of accuracy. What do others think?

Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.