← Back to context

Comment by scosman

2 days ago

> It’s a bit strange how anecdotes have become acceptable fuel for 1000 comment technical debates.

Progress is so fast right now anecdotes are sometimes more interesting than proper benchmarks. "Wow it can do impressive thing X" is more interesting to me than a 4% gain on SWE Verified Bench.

In early days of a startup "this one user is spending 50 hours/week in our tool" is sometimes more interesting than global metrics like average time in app. In the early/fast days, the potential is more interesting than the current state. There's work to be done to make that one user's experience apply to everyone, but knowing that it can work is still a huge milestone.

At this point I believe the anecdotes more than benchmarks, cause I know the LLM devs train the damn things on the benchmarks.

A benchmark? probably was gamed. A guy made an app to right click and convert an image? prolly true, have to assume it may have a lot of issues but prima facie I just make a mental note that this is possible now.