Comment by SparkyMcUnicorn

1 day ago

https://marginlab.ai/trackers/claude-code-historical-perform...

8 comments

SparkyMcUnicorn

I don't believe that trackers like this are trustworthy. There's an enormous financial motive to cheat and these companies have a track record of unethical conduct.

If I was VP of Unethical Business Strategy at OpenAI or Anthropic, the first thing I'd do is put in place an automated system which flags accounts, prompts, IPs, and usage patterns associated with these benchmarks and direct their usage to a dedicated compute pool which wouldn't be affected by these changes.

codezero 1 day ago

the performance degradation I've seen isn't quality/completion but duration, I get good results but much less quickly than I did before 4.6. Still, it's just anecdata, but a lot of folks seem to feel the same.

refulgentis 1 day ago
Been reading posts like these for 3 years now. There’s multiple sites with #s. I’m willing to buy “I’m paying rent on someone’s agent harness and god knows what’s in the system prompt rn”, but in the face of numbers, gotta discount the anecdotal.
- codezero 10 hours ago
  
  You're probably right. It's probably more likely that for some period of time I forgot that I switched to the large context Opus vs Sonnet and it was not needed for the level of complexity of my work.
- coldtea 17 hours ago
  
  Yeah, why trust your actual experience over numbers? Nothing surer than synthetic benchmarks
  
  1 reply →

andai 17 hours ago

This just looks like random noise to me? Is it also random on short timespans, like running it 10x in a row?

SparkyMcUnicorn 3 hours ago

Explained in the methodology at the bottom of this page: https://marginlab.ai/trackers/claude-code/