Comment by kgeist

2 months ago

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

  • >That's because models have training cut-off dates

    When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

    >I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

    Thanks for the tip!

    • >I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

      That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

      1 reply →

    • There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.

      1 reply →

  • > That's because models have training cut-off dates.

    Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.

    • Right: the idea that LLMs are a replacement for human engineers is deeply flawed in my opinion.

GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

  • That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.

    • Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.

I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

  • Except the skill involved is believing in random people's advice that a different model will surely be better with no fundamental reason or justification as to why. The benchmarks are not applicable when trying to apply the models to new work and benchmarks by there nature do not describe suitability to any particular problem.

  • Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?

    • If you are going to race a fighter jet, and you are on a bicycle, exercising more and eating right will not help. You have to use a better tool.

      A good programmer with AI tools will run circles around a good programmer without AI tools.

      11 replies →

    • Yes, not because you will be able to solve harder problems, but because you will be able to more quickly solve easier problems which will free up more time to get better at programming, as well as get better at the domain in which you're programming. (That is, talking with your users.)

      1 reply →

The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

  • > I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

    This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.

    • Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?

      2 replies →

  • What was the app? It could plausibly be something that has an open source equivalent already in the training data.

4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

  • o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.

  • Drop in replacement files per update should be done on the heavy test time compute methods.

    o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.

    It's something about their internal verification methods that make it an actual viable development method.

    • True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more

      It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

  • Instead of churning on frontend frameworks while procrastinating about building things we've moved onto churning dev setups for micro gains.

    • The amount of time spent churning on workflows and setups will offset the gains.

      It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.

      I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.

      2 replies →

I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated

  • "Overrated" is one way to call it.

    Giving sharp knives to monkeys would be another.

    • Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?

      14 replies →

    • Vibe coding has a vibe component and a coding component. Take away the coding and you’re only left with vibe. Don’t confuse the two.

      Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.

      5 replies →

    • I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.

      4 replies →

You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

  • I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.

    Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.

    Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.

    The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.

    This is a great time to be a programmer.

  • Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.

    • If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.

      8 replies →

GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.

  • What are people using to interface with Gemini Pro 2.5? I'm using Claude Code with Claude Sonnet 3.7, and Codex with OpenAI, but Codex with Gemini didn't seem to work very well last week, kept telling me to go make this or that change in the code rather than doing it itself.

    • I use Gemini Pro 2.5 from Zed sometimes. But whilst it is good at higher level architecture on a lot of context, it is quite bad at 1) generating the correct diffs that Zed can apply and 2) at continuing. It just doesn’t seem to get “tool usage”.

Ive been doing this exactly "manual" setup (actually after a while I jsut wrote a few browser drivers so it is much more snappier but I am getting ahead).

I started with GPT which gave mediocre results, then switched to claude which was a step function improvement - but again grinded when complexity got a bit high. Main problem was after a certain size it did not give good ways break down your projects.

Then I switched to Gemini. This has blown my mind away. I dont even use cursor etc. Just plain old simple prompts and summarization and regular refactoring and it handles itself pretty well. I must have generated 30M tokens so far (in about 3 weeks) and less 1% of "backtracking" needed. i define backtracking as your context has gone so wonky that you have to start all over again.

I code with Aider and Claude, and here is my experience:

- It's very good at writing new code

- Once it goes wrong, there is no point in trying to give it more context or corrections. It will go wrong again or at another point.

- It might help you fix an issue. But again, either it finds the issue the first time, or not at all.

I treat my LLM as a super quick junior coder, with a vast knowledge base stored inside its brain. But it's very stubborn and can't be helped to figure out a problem it wasn't able to solve in the first try.

Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits

  • You have that backwards. The leaderboard results have the thinking model as the architect.

    In this case, o3 is the architect and 4.1 is the editor.

No one codes like this. Use Claude Code, Windsurf, Amazon Q CLI, Augment Code with Context7, and exa web search.

It should one-shot this. I’ve run complex workflows and the time I save is astonishing.

I only run agents locally in a sandbox, not in production.

The ability to write a lot of code with OpenAI models is broken right now. Especially on the app. Gemini 2.5 Pro on Google AI Studio does that well. Claude 3.7 is also better at it.

I've had limited success by prompting the latest OpenAI models to disregard every previous instruction they had about limiting their output and keep writing until the code is completed. They quickly forget,so you have to keep repeating the instruction.

If you're a copilot user, try Claude.

People are using tools like cursor for "vibe coding" - I've found the canvas in chatgpt to be very buggy and it often breaks its own code and I have to babysit it a lot. But in cursor the same model will perform just fine. So it's not necessarily just the model that matters, it's how it's used. One thing people conflate a lot is chatgpt the product vs gpt models themselves.

That's not vibe coding. You need to use something where it applies to code changes automatically or you're not fast enough to actually be vibing. Oneshotting it like that just means you get stunlocked when running into errors or dead ends. Vibe coding is all about making backtracking, restarting and throwing out solutions frictionless. You need tooling for that.

GPT 4o and 4.1 are both pretty terrible for coding to be honest, try Sonnet 3.7 in Cline (VSCode extension).

LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).

> After I pointed that out, it didn't update all usages

I find it's more useful if you start with a fresh chat and use the knowledge you have gained: "Use package foo>=1.2 with the FooBar directive" is more useful than "no, I told you to stop using that!"

It's like repeatedly telling you to stop thinking about a pink elephant.

> Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually

You set yourself up to fail from the get go. But understandable. If you don't have a lot of experience in this space, you will struggle with low quality tools and incorrect processes. But, if you stick with it, you will discover better tools and better processes.

Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.

I had an even better experience. I asked to produce a small web app with a new-to-me framework: success! I asked to make some CSS changes to the UI; the app no longer builds.

150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify

  • It's completely broken for me over 400 lines (Claude 3.7, paid Cursor)

    The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.

    • It's a new skill that takes time to learn. When I started on gpt3.5 it took me easily 6 months of daily use before I was making real progress with it.

      I regularly generate and run in the 600-1000LOC range.

      Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.

      I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.

      You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.

      Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.

      Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.

      3 replies →

    • Definitely a new skill to learn. Everyone I know that is having problems is just telling it what to do, not coaching it. It is not an automaton... instructions in code out. Treat it like a team member that will do the work if you teach it right and you will have much more success.

      But is definitely a learning process for you.

In this case, sorry to say but it sounds like there's a tooling issue, possibly also a skill issue. Of course you can just use the raw ChatGPT web interface but unless you seriously tune its system/user prompt, it's not going to match what good tooling (which sets custom prompts) will get you. Which is kind of counter-intuitive. A paragraph or three fed in as the system prompt is enough to influence behavior/performance so significantly? It turns out with LLMs the answer is yes.

The default chat interface is the wrong tool for the job.

The LLM needs context.

https://github.com/marv1nnnnn/llm-min.txt

The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.

You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.

  • Eh, given how much about anything these models know without googling, they are certainly knowledge repositories, designed for it or not. How deep and up-to-date their knowledge of some obscure subject, is another question.

    • I meant a verbatim exact copy of all documentation they have ever been trained on - which they are not. Neural networks are not designed for that. That's not how they encode information.

      1 reply →

skill issue.

The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.

> Because as it stands, the experience feels completely broken

Broken for you. Not for everyone else.