Comment by westoncb

6 months ago

There is a skill to it. You can get lucky as a beginner but if you want consistent success you gotta learn the ropes (strengths, weaknesses, failure modes etc).

A quick way of getting seriously improved results though: if you are literally using GPT-4 as you mention—that is an ancient model! Parent comment says GPT-4.1 (yes openai is unimaginably horrible at naming but that ".1" isn't a minor version increment). And even though 4.1 is far better, I would never use it for real work. Use the strongest models; if you want to stick with openai use o3 (it's now super cheapt too). Gemini 2.5 Pro is roughly equivalent to o3 for another option. IMO Claude models are stronger in agentic setting, but won't match o3 or gemini 2.5 pro for deep problem solving or nice, "thought out" code.

Specific model I was using was o4-mini-high which the drop-down model selector describes as "Great at coding and visual reasoning".

  • I'm curious how you ended up in such a conversation in the first place. Hallucinations are one thing, but I can't remember the last time when the model was saying that it actually run something somewhere that wasn't a tool use call, or that it owns a laptop, or such - except when role-playing.

    I wonder if the advice on prompting models to role play isn't backfiring now, especially in conversational setting. Might even be a difference between "you are an AI assistant that's an expert programmer" vs. "you are an expert programmer" in the prompt, the latter pushing it towards "role-playing a human" region of the latent space.

    (But also yeah, o3. Search access is the key to cutting down on amount of guessing the answers, and o3 is using it judiciously. It's the only model I use for "chat" when the topic requires any kind of knowledge that's niche or current, because it's the only model I see can reliably figure out when and what to search for, and do it iteratively.)

    • I've seen that specific kind of role-playing glitch here and there with the o[X] models from openai. The models do kinda seem to just think of themselves as being developers with their own machines.. I think it usually just doesn't come up but can easily be tilted into it.

    • What is really interesting is in the "thinking" section it said "I need to reassure the user..." so my intuition is that it thought it was right, but did not think I would think they were right, but if they just gave me the confidence, I would try the code and unblock myself. Maybe it thought this was the best % chance I would listen to it and so it is the correct response?

      5 replies →

    • A friend recently had a similar interaction where ChatGPT told them that it had just sent them an email or a wetransfer with the requested file

  • Gotcha. Yeah, give o3 a try. If you don't want to get a sub, you can use it over the api for pennies. They do have you do this biometric registration thing that's kind of annoying if you want to use over api though.

    You can get the Google pro subscription (forget what they call it) that's ordinarily $20/mo for free right now (1 month free; can cancel whenever), which gives unlimited Gemini 2.5 Pro access.

    • > Gotcha. Yeah, give o3 a try. If you don't want to get a sub, you can use it over the api for pennies. They do have you do this biometric registration thing that's kind of annoying if you want to use over api though.

      I hope you appreciate just how crazy this sentence sounds, even in an age when this is normalised.

      1 reply →

    • Yeah, this model didn't work it seems.

      You're holding it wrong. You need to utter the right series of incantations to get some semblance of truth.

      What, you used the model that was SOTA one week ago? Big mistake, that explains why.

      You need to use this SOTA model that came out one day ago instead. That model definitely wasn't trained to overfit the week-old benchmarks and dismiss the naysayers. Look, a pelican!

      What? You haven't verified your phone number and completed a video facial scan and passed a background check? You're NGMI.

      1 reply →

    • Thank you for the tip on o3. I will switch to that and see how it goes. I do have a paid sub for ChatGPT, but from the dropdown model descriptions "Great at coding" sounded better than "Advanced reasoning". And 4 is like almost twice as much as 3.

      13 replies →

  • All LLMs can fail this way.

    It's kind of weird to see people running into this kind of issue with modern large models with all the RL and getting confused. No one starting today seems to have good intuition for them. One person I knew insisted LLMs could do structural analysis for months until he saw some completely absurd output from one. This used to be super common with small GPTs from around 2022 and so everyone just intuitively knew to watch out for it.

Literally astrology at this point. We don't understand the black box bs generating machine, but actually if you prod it this and that way according to some vague vibe, then it yields results that even if wrong are enough to fool you.

And christ, every single time there's the same retort: "ah but of course your results are shit, you must not be using gpt-4.69-o7-turbo-pro which came out this morning". Come on...

  • You sit at the opposite of the spectrum, refusing with all your might that there might be something useful there at all. It's all just a BS generator that nothing, nothing at all useful can come out of, right? You might think you are a staunch critic and realist that no hype can touch and you see through all of it, when in fact your are wilfully ignorant.

    • Here's some BS for you.

      That's an unfair mischaracterization of their position. Criticism doesn't equal rejection, and skepticism isn't the same as ignorance. Pointing out limitations, failures, or hype doesn't mean they are claiming there's nothing useful or that the entire technology is inherently worthless.

      Being critical is not about denying all value—it’s about demanding evidence, accuracy, and clarity amid inflated claims. In fact, responsible critique helps improve technology by identifying where it falls short, so it can evolve into something genuinely useful and reliable.

      What you're calling "willful ignorance" is, in reality, a refusal to blindly accept marketing narratives or inflated expectations. That’s not being closed-minded—that’s being discerning.

      If there is something truly valuable, it will stand up to scrutiny.

    • > refusing with all your might that there might be something useful there at all

      How does this follow from what I wrote? I addressed two very concrete points.