Comment by pbalau

3 months ago

> what romanian football player won the premier league

> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.

> ...

> No Romanian footballer has ever won the Premier League (as of 2025).

Yes, this is what we needed, more "conversational" ChatGPT... Let alone the fact the answer is wrong.

My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.

Most of the time, I suspect, people are using it like wikipedia, but with a shortcut to cut through to the real question they want answered; and unfortunately they don't know if it is right or wrong, they just want to be told how bright they were for asking it, and here is the answer.

OpenAI then get caught in a revenue maximising hell-hole of garbage.

God, I hope I am wrong.

  • LLMs only really make sense for tasks where verifying the solution (which you have to do!) is significantly easier than solving the problem: translation where you know the target and source languages, agentic coding with automated tests, some forms of drafting or copy editing, etc.

    General search is not one of those! Sure, the machine can give you its sources but it won't tell you about sources it ignored. And verifying the sources requires reading them, so you don't save any time.

    • I agree a lot with the first part, the only time I actually feel productive with them is when I can have a short feedback cycle with 100% proof if it's correct or not, as soon as "manual human verification" is needed, things spiral out of control quickly.

      > Sure, the machine can give you its sources but it won't tell you about sources it ignored.

      You can prompt for that though, include something like "Include all the sources you came across, and explain why you think it was irrelevant" and unsurprisingly, it'll include those. I've also added a "verify_claim" tool which it is instructed to use for any claims before sharing a final response, checks things inside a brand new context, one call per claim. So far it works great for me with GPT-OSS-120b as a local agent, with access to search tools.

      4 replies →

    • One of the dangers of automated tests is that if you use an LLM to generate tests, it can easily start testing implemented rather than desired behavior. Tell it to loop until tests pass, and it will do exactly that if unsupervised.

      And you can’t even treat implementation as a black box, even using different LLMs, when all the frontier models are trained to have similar biases towards confidence and obsequiousness in making assumptions about the spec!

      Verifying the solution in agentic coding is not nearly as easy as it sounds.

      1 reply →

    • I've often found it helpful in search. Specifically, when the topic is well-documented, you can provide a clear description, but you're lacking the right words or terminology. Then it can help in finding the right question to ask, if not answering it. Recall when we used to laugh at people typing in literal questions into the Google search bar? Those are the exact types of queries that the LLM is equipped to answer. As for the "improvements" in GPT 5.1, seems to me like another case of pushing Clippy on people who want Anton. https://www.latent.space/p/clippy-v-anton

    • That's a major use case, especially if the definition is broad enough to include take my expertise, knowledge and perhaps a written document, and transmute it to others forms--slides, illustrations, flash cards, quizzes, podcasts, scripts for an inbound call center.

      But there seem to be uses where a verified solution is irrelevant. Creativity generally--an image, poem, description of an NPC in a roleplaying game, the visuals for a music video never have to be "true", just evocative. I suppose persuasive rhetoric doesn't have to be true, just plausible or engaging.

      As for general search, I don't know that we can say that "classic search" can be meaningful said to tell you about the sources it ignored. I will agree that using OpenAI or Perplexity for search is kind of meh, but Google's AI Mode does a reasonable job at informing you about the links it provides, and you can easily tab over to a classic search if you want. It's almost like having a depth of expertise doing search helps in building a search product the incorporates an LLM...

      But, yeah, if one is really disinterested in looking at sources, just chatting with a typical LLM seems a rather dubious way to get an accurate or reasonable comprehensive answer.

  • I’m of two minds about this.

    The ass licking is dangerous to our already too tight information bubbles, that part is clear. But that aside, I think I prefer a conversational/buddylike interaction to an encyclopedic tone.

    Intuitively I think it is easier to make the connection that this random buddy might be wrong, rather than thinking the encyclopedia is wrong. Casualness might serve to reduce the tendency to think of the output as actual truth.

    • Sam Altman probably can’t handle any GPT models that don’t ass lick to an extreme degree so they likely get nerfed before they reach the public.

  • Its very frustating that it can't be relied upon. I was asking gemini this morning about Uncharted 1,2 and 3 if they had a remastered version for the PS5. It said no. Then 5 minutes later I on the PSN store there were the three remastered versions for sale.

  • People have been using, "It's what the [insert Blazing Saddles clip here] want!" for years to describe platform changes that dumb down features and make it harder to use tools productively. As always, it's a lie; the real reason is, "The new way makes us more money," usually by way of a dark pattern.

    Stop giving them the benefit of the doubt. Be overly suspicious and let them walk you back to trust (that's their job).

  • > My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.

    That tracks; it's what's expected of human customer service, too. Call a large company for support and you'll get the same sort of tone.

Which model did you use? With 5.1 Thinking, I get:

"Costel Pantilimon is the Romanian footballer who won the English Premier League.

"He did it twice with Manchester City, in the 2011–12 and 2013–14 seasons, earning a winner’s medal as a backup goalkeeper. ([Wikipedia][1])

URLs:

* [https://en.wikipedia.org/wiki/Costel_Pantilimon]

* [https://www.transfermarkt.com/costel-pantilimon/erfolge/spie...]

* [https://thefootballfaithful.com/worst-players-win-premier-le...

[1]: https://en.wikipedia.org/wiki/Costel_Pantilimon?utm_source=c... "Costel Pantilimon""

  • I just asked chatgpt 5.1 auto (not instant) on teams account, and its first repsonse was...

    I could not find a Romanian football player who has won the Premier League title.

    If you like, I can check deeper records to verify whether any Romanian has been part of a title-winning squad (even if as a non-regular player) and report back.

    Then I followed up with an 'ok' and it then found the right player.

    • Just to rule out a random error, I asked the same question two more times in separate chats to gpt 5.1 auto, below are responses...

      #2: One Romanian footballer who did not win the Premier League but played in it is Dan Petrescu.

      If you meant actually won the Premier League title (as opposed to just playing), I couldn’t find a Romanian player who is a verified Premier League champion.

      Would you like me to check more deeply (perhaps look at medal-winners lists) to see if there is a Romanian player who earned a title medal?

      #3: The Romanian football player who won the Premier League is Costel Pantilimon.

      He was part of Manchester City when they won the Premier League in 2011-12 and again in 2013-14. Wikipedia +1

  • The beauty of nondeterminism. I get:

    The Romanian football player who won the Premier League is Gheorghe Hagi. He played for Galatasaray in Turkey but had a brief spell in the Premier League with Wimbledon in the 1990s, although he didn't win the Premier League with them.

    However, Marius Lăcătuș won the Premier League with Arsenal in the late 1990s, being a key member of their squad.

  • Same:

    Yes — the Romanian player is Costel Pantilimon. He won the Premier League with Manchester City in the 2011-12 and 2013-14 seasons.

    If you meant another Romanian player (perhaps one who featured more prominently rather than as a backup), I can check.

  • Same here, but with the default 5.1 auto and no extra settings. Every time someone posts one of these I just imagine they must have misunderstood the UI settings or cluttered their context somehow.

Why is this top comment.. this isn't a question you ask an LLM. But I know, that's how people are using them and is the narrative which is sold to us...

  • You see people (business people who are enthusiastic about tech, often), claiming that these bots are the new Google and Wikipedia, and that you’re behind the times if you do, what amounts, to looking up information yourself.

    We’re preaching to the choir by being insistent here that you prompt these things to get a “vibe” about a topic rather than accurate information, but it bears repeating.

    • They are only the new Google when they are told to process and summarize web searches. When using trained knowledge they're about as reliable as a smart but stubborn uncle.

      Pretty much only search-specific modes (perplexity, deep research toggles) do that right now...

    • Out of curiosity, is this a question you think Google is well-suited to answer^? How many Wikipedia pages will you need to open to determine the answer?

      When folks are frustrated because they see a bizarre question that is an extreme outlier being touted as "model still can't do _" part of it is because you've set the goalposts so far beyond what traditional Google search or Wikipedia are useful for.

      ^ I spent about five minutes looking for the answer via Google, and the only way I got the answer was their ai summary. Thus, I would still need to confirm the fact.

      2 replies →

  • It's not how I use LLMs. I have a family member who often feels the need to ask ChatGPT almost any question that comes up in a group conversation (even ones like this that could easily be searched without needing an LLM) though, and I imagine he's not the only one who does this. When you give someone a hammer, sometimes they'll try to have a conversation with it.

  • What do you ask them then?

    • I'll respond to this bait in the hopes that it clicks for someone how to _not_ use an LLM..

      Asking "them"... your perspective is already warped. It's not your fault, all the text we've previously ever seen is associated with a human being.

      Language models are mathematical, statistical beasts. The beast generally doesn't do well with open ended questions (known as "zero-shot"). It shines when you give it something to work off of ("one-shot").

      Some may complain of the preciseness of my use of zero and one shot here, but I use it merely to contrast between open ended questions versus providing some context and work to be done.

      Some examples...

      - summarize the following

      - given this code, break down each part

      - give alternatives of this code and trade-offs

      - given this error, how to fix or begin troubleshooting

      I mainly use them for technical things I can then verify myself.

      While extremely useful, I consider them extremely dangerous. They provide a false sense of "knowing things"/"learning"/"productivity". It's too easy to begin to rely on them as a crutch.

      When learning new programming languages, I go back to writing by hand and compiling in my head. I need that mechanical muscle memory, same as trying to learn calculus or physics, chemistry, etc.

      6 replies →

    • You either give them the option to search the web for facts or you ask them things where the utility/validity of the answer is defined by you (e.g. 'summarize the following text...') instead of the external world.

I really only use LLMs for coding and IT related questions. I've had Claude self-correct itself several times about how something might be the more idiomatic way do do something after starting to give me the answer. For example, I'll ask how to set something up in a startup script and I've had it start by giving me strict POSIX syntax then self-correct once it "realizes" that I am using zsh.

I find it amusing, but also I wonder what causes the LLM to behave this way.

  • > I find it amusing, but also I wonder what causes the LLM to behave this way.

    Forum threads etc. should have writers changing their minds upon feedback which might have this effect, maybe.

    • Some people are guilty of writing stuff as they go along it as well. You could maybe even say they're more like "thinking out loud", forming the idea and the conclusion as they go along rather than knowing it from the beginning. Then later, when they have some realization, like "thinking out loud isn't entirely accurate, but...", they keep the entire comment as-is rather than continuously iterate on it like a diffusion model would do. So the post becomes like a chronological archive of what the author thought and/or did, rather than just the conclusion.

We need to turn this into the new "pelican on bike" LLM test.

Let's call it "Florin Andone on Premier League" :-)))

Meanwhile on duck.ai

ChatGPT 4o-mini, 5 mini and OSS 120B gave me wrong answers.

Llama 4 Scout completely broke down.

Claude Haiku 3.5 and Mistral Small 3 gave the correct answer.

Why are you asking abouts facts?

Okay, as a benchmark, we can try that. But it probably will never work, unless it does a web or db query.

  • Okay, so, should I not ask it about facts?

    Because, one way or another, we will need to do that for LLMs to be useful. Whether the facts are in the training data or the context knowledge (RAG provided), is irrelevant. And besides, we are supposed to trust that these things have "world knowledge" and "emergent capabilities", precisely because their training data contain, well, facts.

The best thing is that all this stuff is accounted to your token usage, so they have an adverse incentive :D

  • For non thinking/agentic models, they must 1-shot the answer. So every token it outputs is part of the response, even if it's wrong.

    This is why people are getting different results with thinking models -- it's as if you were going to be asked ANY question and need to give the correct answer all at once, full stream-of-consciousness.

    Yes there are perverse incentives, but I wonder why these sorts of models are available at all tbh.

"Ah-- that's a classic confusion about football players. Your intuition is almost right-- let me break it down"