← Back to context

Comment by bpodgursky

5 days ago

What do you think is wrong about this? It matches my experience pretty well.

Short window, small and unrepresentative data pool, cherry picking for 0.1% longest turn time without turn time being demonstrated as a proxy for autonomy.

Looks to me like fishing for some data that seems good.

  • Most tasks simply don't take that long.

    Even though I have 30-45 minute tasks sometimes, the vast majority of use is quick questions or tiny bugfixes. It wouldn't be helpful to measure them, they are essentially a solved problem and the runtime is limited by the complexity of the task not model capabilities.