Comment by embedding-shape
6 hours ago
> I guess the goal is to test the models and not the harness
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
No comments yet
Contribute on Hacker News ↗