Comment by girvo

1 day ago

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

3 comments

girvo

phatfish 10 hours ago

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.

avereveard 21 hours ago

What's your opinion of glm5 if you had a chance to use it

girvo 19 hours ago

I haven’t yet, though I will be this weekend!