Comment by kgeist

4 months ago

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

134 comments

kgeist

simonw 4 months ago

"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

kgeist 4 months ago
>That's because models have training cut-off dates
When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.
>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.
Thanks for the tip!
- jmcpheron 4 months ago
  
  >I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.
  That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.
  
  1 reply →
- fragmede 4 months ago
  
  There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.
  
  1 reply →
mbesto 4 months ago
> That's because models have training cut-off dates.
Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.
- simonw 4 months ago
  
  Right: the idea that LLMs are a replacement for human engineers is deeply flawed in my opinion.
sagarpatil 4 months ago

Context7 MCP solves this. Use it with Cursor/Windsurf.

thorum 4 months ago

GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

bjt12345 4 months ago
That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.
- k__ 4 months ago
  
  Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.

ebiester 4 months ago

I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

th0ma5 4 months ago

Except the skill involved is believing in random people's advice that a different model will surely be better with no fundamental reason or justification as to why. The benchmarks are not applicable when trying to apply the models to new work and benchmarks by there nature do not describe suitability to any particular problem.
wtetzner 4 months ago
Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?
- cheema33 4 months ago
  
  If you are going to race a fighter jet, and you are on a bicycle, exercising more and eating right will not help. You have to use a better tool.
  A good programmer with AI tools will run circles around a good programmer without AI tools.
  
  11 replies →
- ebiester 4 months ago
  
  Yes, not because you will be able to solve harder problems, but because you will be able to more quickly solve easier problems which will free up more time to get better at programming, as well as get better at the domain in which you're programming. (That is, talking with your users.)
  
  1 reply →
- drittich 4 months ago
  
  Perhaps that's a false dichotomy?
  
  1 reply →
- cyral 4 months ago
  
  You can do both

codethief 4 months ago

The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

cheema33 4 months ago
> I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.
This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.
- lftl 4 months ago
  
  Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?
  
  2 replies →
suddenlybananas 4 months ago

What was the app? It could plausibly be something that has an open source equivalent already in the training data.

nico 4 months ago

4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

manmal 4 months ago
o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.
- kenjackson 4 months ago
  
  I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"
  
  1 reply →
- hnhn34 4 months ago
  
  Just in case you didn't know, they raised the rate limit from ~50/week to ~50/day a while ago
  
  1 reply →
johnsmith1840 4 months ago
Drop in replacement files per update should be done on the heavy test time compute methods.
o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.
It's something about their internal verification methods that make it an actual viable development method.
- nico 4 months ago
  
  True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more
  It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

danbmil99 4 months ago

As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

candiddevmike 4 months ago
Instead of churning on frontend frameworks while procrastinating about building things we've moved onto churning dev setups for micro gains.
- latentsea 4 months ago
  
  The amount of time spent churning on workflows and setups will offset the gains.
  It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.
  I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.
  
  2 replies →
- mycall 4 months ago
  
  > churning dev setups for micro gains.
  Devs have been doing micro changes to their setup for 50 years. It is the nature of their beast.
  
  3 replies →

fsndz 4 months ago

I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated

the_af 4 months ago
"Overrated" is one way to call it.
Giving sharp knives to monkeys would be another.
- lnenad 4 months ago
  
  Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?
  
  14 replies →
- baq 4 months ago
  
  Vibe coding has a vibe component and a coding component. Take away the coding and you’re only left with vibe. Don’t confuse the two.
  Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.
  
  5 replies →
- zo1 4 months ago
  
  I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.
  
  4 replies →

visarga 4 months ago

You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

tqwhite 4 months ago

I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.
Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.
Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.
The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.
This is a great time to be a programmer.
prisenco 4 months ago
Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.
- fragmede 4 months ago
  
  If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.
  
  8 replies →

abiraja 4 months ago

GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.

linsomniac 4 months ago
What are people using to interface with Gemini Pro 2.5? I'm using Claude Code with Claude Sonnet 3.7, and Codex with OpenAI, but Codex with Gemini didn't seem to work very well last week, kept telling me to go make this or that change in the code rather than doing it itself.
- tinodb 4 months ago
  
  I use Gemini Pro 2.5 from Zed sometimes. But whilst it is good at higher level architecture on a lot of context, it is quite bad at 1) generating the correct diffs that Zed can apply and 2) at continuing. It just doesn’t seem to get “tool usage”.

flashgordon 4 months ago

Ive been doing this exactly "manual" setup (actually after a while I jsut wrote a few browser drivers so it is much more snappier but I am getting ahead).

I started with GPT which gave mediocre results, then switched to claude which was a step function improvement - but again grinded when complexity got a bit high. Main problem was after a certain size it did not give good ways break down your projects.

Then I switched to Gemini. This has blown my mind away. I dont even use cursor etc. Just plain old simple prompts and summarization and regular refactoring and it handles itself pretty well. I must have generated 30M tokens so far (in about 3 weeks) and less 1% of "backtracking" needed. i define backtracking as your context has gone so wonky that you have to start all over again.

koonsolo 4 months ago

I code with Aider and Claude, and here is my experience:

- It's very good at writing new code

- Once it goes wrong, there is no point in trying to give it more context or corrections. It will go wrong again or at another point.

- It might help you fix an issue. But again, either it finds the issue the first time, or not at all.

I treat my LLM as a super quick junior coder, with a vast knowledge base stored inside its brain. But it's very stubborn and can't be helped to figure out a problem it wasn't able to solve in the first try.

Jarwain 4 months ago

Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits

SparkyMcUnicorn 4 months ago

You have that backwards. The leaderboard results have the thinking model as the architect.
In this case, o3 is the architect and 4.1 is the editor.
drewnick 4 months ago

I see o3 (high) + gpt-4.1 at 82.7% -- the highest on the benchmark currently.

sagarpatil 4 months ago

No one codes like this. Use Claude Code, Windsurf, Amazon Q CLI, Augment Code with Context7, and exa web search.

It should one-shot this. I’ve run complex workflows and the time I save is astonishing.

I only run agents locally in a sandbox, not in production.

seunosewa 4 months ago

The ability to write a lot of code with OpenAI models is broken right now. Especially on the app. Gemini 2.5 Pro on Google AI Studio does that well. Claude 3.7 is also better at it.

I've had limited success by prompting the latest OpenAI models to disregard every previous instruction they had about limiting their output and keep writing until the code is completed. They quickly forget,so you have to keep repeating the instruction.

If you're a copilot user, try Claude.

zachrip 4 months ago

People are using tools like cursor for "vibe coding" - I've found the canvas in chatgpt to be very buggy and it often breaks its own code and I have to babysit it a lot. But in cursor the same model will perform just fine. So it's not necessarily just the model that matters, it's how it's used. One thing people conflate a lot is chatgpt the product vs gpt models themselves.

Kiro 4 months ago

That's not vibe coding. You need to use something where it applies to code changes automatically or you're not fast enough to actually be vibing. Oneshotting it like that just means you get stunlocked when running into errors or dead ends. Vibe coding is all about making backtracking, restarting and throwing out solutions frictionless. You need tooling for that.

smcleod 4 months ago

GPT 4o and 4.1 are both pretty terrible for coding to be honest, try Sonnet 3.7 in Cline (VSCode extension).

LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).

exe34 4 months ago

> After I pointed that out, it didn't update all usages

I find it's more useful if you start with a fresh chat and use the knowledge you have gained: "Use package foo>=1.2 with the FooBar directive" is more useful than "no, I told you to stop using that!"

It's like repeatedly telling you to stop thinking about a pink elephant.

cheema33 4 months ago

> Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually

You set yourself up to fail from the get go. But understandable. If you don't have a lot of experience in this space, you will struggle with low quality tools and incorrect processes. But, if you stick with it, you will discover better tools and better processes.

85392_school 4 months ago

Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.

skeeter2020 4 months ago

I had an even better experience. I asked to produce a small web app with a new-to-me framework: success! I asked to make some CSS changes to the UI; the app no longer builds.

theropost 4 months ago

150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify

jokethrowaway 4 months ago
It's completely broken for me over 400 lines (Claude 3.7, paid Cursor)
The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.
- johnsmith1840 4 months ago
  
  It's a new skill that takes time to learn. When I started on gpt3.5 it took me easily 6 months of daily use before I was making real progress with it.
  I regularly generate and run in the 600-1000LOC range.
  Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.
  I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.
  You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.
  Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.
  Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.
  
  3 replies →
- tqwhite 4 months ago
  
  Definitely a new skill to learn. Everyone I know that is having problems is just telling it what to do, not coaching it. It is not an automaton... instructions in code out. Treat it like a team member that will do the work if you teach it right and you will have much more success.
  But is definitely a learning process for you.
- koakuma-chan 4 months ago
  
  Sounds like a Cursor issue
- fragmede 4 months ago
  
  what language?

koakuma-chan 4 months ago

You gotta use a reasoning model.

fragmede 4 months ago

In this case, sorry to say but it sounds like there's a tooling issue, possibly also a skill issue. Of course you can just use the raw ChatGPT web interface but unless you seriously tune its system/user prompt, it's not going to match what good tooling (which sets custom prompts) will get you. Which is kind of counter-intuitive. A paragraph or three fed in as the system prompt is enough to influence behavior/performance so significantly? It turns out with LLMs the answer is yes.

vFunct 4 months ago

Use Claude Sonnet with an IDE.

hollownobody 4 months ago

Try o3 please. Via UI.

voidspark 4 months ago

The default chat interface is the wrong tool for the job.

The LLM needs context.

https://github.com/marv1nnnnn/llm-min.txt

The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.

You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.

Sharlin 4 months ago
Eh, given how much about anything these models know without googling, they are certainly knowledge repositories, designed for it or not. How deep and up-to-date their knowledge of some obscure subject, is another question.
- voidspark 4 months ago
  
  I meant a verbatim exact copy of all documentation they have ever been trained on - which they are not. Neural networks are not designed for that. That's not how they encode information.
  
  1 reply →

LewisVerstappen 4 months ago

skill issue.

The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.

> Because as it stands, the experience feels completely broken

Broken for you. Not for everyone else.