Comment by BoorishBears
8 hours ago
I run an LLM based product in a completely different space (consumer) and I think this is kind of an impossible unsolvable part of developing products that rely on LLMs.
No matter what, powers users always say the model is degrading over time*. Even when every stat I have access to says otherwise.
(* to clarify, this is outside of actual model changes)
I suspect some of it is the fact context windows growing does harm performance, and early on you're more likely to be prodding at things in a way that has a smaller context window on average.
But I also think users just inherently are less reliable narrators than they think. They say they're trying the same tasks, but it may be the "same task" applied to a codebase with 1 month's more worth of development and complexity.
Or it's the "same task" but their less confident past self was "Clever Hans"-ing the model with some nuance that they've since discarded without realizing.
Or it's simple expectation creep and the tasks aren't similar at all from an LLM perspective due to limited generalization, but from a human perspective are. Switching languages might as well make it a new task as far LLM performance for example, but the human considers it the same task in a new language.
-
Whatever causes it, it's especially stressful because sometimes you do degrade the harness entirely accidentally but it's impossible to separate that signal from the noise from user accounts and an issue goes unfound way longer than it should.
Claude Code is somewhat fortunate that code has verifiable aspects though, so you don't need to 100% go on user account. My usecase relies much more on subjective preference, so dealing with this stuff becomes the 9th circle of hell.
There've been many times when a change to the LLM stack didn't make it to prod, I jumped the gun on announcing it, but users immediately flooded in with praise that the "missing" performance had returned.
No comments yet
Contribute on Hacker News ↗