Comment by nicebyte
5 months ago
How did you draw that conclusion from reading the contents of the link? This is a benchmark.
> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.
5 months ago
How did you draw that conclusion from reading the contents of the link? This is a benchmark.
> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.
No comments yet
Contribute on Hacker News ↗