Comment by phatfish
12 hours ago
Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?
That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.
No comments yet
Contribute on Hacker News ↗