Comment by girvo
1 day ago
Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.
I'll withhold judgement until I've tried to use it.
Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?
That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.
What's your opinion of glm5 if you had a chance to use it
I haven’t yet, though I will be this weekend!