Can LLMs write better code if you keep asking them to “write better code”?

1 year ago (minimaxir.com)

I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.

On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:

Base: 55ms

Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).

The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.

  • There's another, arguably even simpler, optimization that makes me smile. (Because it's silly and arises only from the oddity of the task, and because it's such a huge performance gain.)

    You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).

    Timings on my machine:

    Naive implementation (mod+div for digit-sums): 1.6s. Computing digit-sum only when out of range: 0.12s. Checking for the usual case first: 0.0004s.

    The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.

    I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)

    • This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.

      52 replies →

    • This gave me an idea that we can skip the whole pass over the million draws by noting that the count of draws landing in my precomputed set M (digits-sum=30) is Binomial(n=1mln, p=|M|/100k). Then we sample that count X. If X=0, the difference is not defined. Otherwise, we can directly draw (min,max) from the correct joint distribution of indices (like you’d get if you actually did X draws in M). Finally we return M[max] - M[min]. It’s O(1) at runtime (ignoring the offline step of listing all numbers whose digits sum to 30).

    • In fact, we could simply check for the 3 smallest and the 3 highest numbers and ignore the rest.

      Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.

      If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.

    • for 10^5, to get the same collision probability (~2 * exp(-10)), you would just need to compute the 10 maximum/minimum candidates and check against those.

    • With this trick you can test while generating the random numbers and if you see both values, you can short circuit the generation of random numbers.

      6 replies →

    • No, you're right, I should have said 550ms and 100ms, I'm having a doof morning about timing. Thank you! Too late to edit my post.

  • This exactly highlights my fear of widespread use of LLMs for code - missing the actual optimisations because we’re stuck in a review, rather than create, mode of thinking.

    But maybe that’s a good thing for those of us not dependent on LLMs :)

    • Well if you or anyone else that has good optimization and performance chops http://openlibrary.org/ has been struggling with performance a bit lately and it's hard to track down the cause. CPU load is low and nothing too much has changed lately so it's unlikely to be a bad query or something.

      Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.

      5 replies →

  • Another speed-up is to skip the sum of digits check if n % 9 != 30 % 9. Sum of digits have the same remainder divided by 9 as the number. This rules out 8/9 = 88% candidates.

  • > Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.

    How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)

    On the other hand, we only need to consider numbers congruent to 3 modulo 9.

    And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.

    And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.

    And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.

  • I actually think precomputing the numbers with digit sum 30 is the best approach. I'd give a very rough estimate of 500-3000 candidates because 30 is rather high, and we only need to loop for the first 4 digits because the fifth can be calculated. After that, it is O(1) set/dict lookups for each of the 1000000 numbers.

    Everything can also be wrapped in list comprehensions for top performance.

  • It's decent when you prompt it to find easy-to-miss but substantial improvements around corner cases, which is something I've taken to doing.

    Basically you just have to put it in the mode that's looking for such things

  • (Small correction, multiply my times by 10, sigh, I need an LLM to double check that I'm converting seconds to milliseconds right. Base 550ms, optimized 70ms)

  • Or the other obvious optimization to hard-code the lookup in code as a huge list, instead of creating it first.

  • I had a scan of the code examples, but one other idea that occurred to me is that you could immediately drop any numbers below 999 (probably slightly higher, but that would need calculation rather than being intuitive).

    • > probably slightly higher, but that would need calculation rather than being intuitive

      I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)

I've noticed this with GPT as well -- the first result I get is usually mediocre and incomplete, often incorrect if I'm working on something a little more obscure (eg, OpenSCAD code). I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".

The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?

  • We've entered the voodoo witch doctor phase of LLM usage: "Enter thee this arcane incantation along with thy question into the idol and, lo, the ineffable machine spirits wilt be appeased and deign to grant thee the information thou hast asked for."

    • This has been part of LLM usage since day 1, and I say that as an ardent fan of the tech. Let's not forget how much ink has been spilled over that fact that "think through this step by step" measurably improved/improves performance.

      3 replies →

    • It is because the chance of the right answer goes down exponentially as the complexity of what is being asked goes up.

      Asking a simpler question is not voodoo.

      On the other hand, I think many people are trying various rain dances and believing it was a specific dance that was the cause when it happened to rain.

    • We use the approaching of feeding mistakes from LLM generated code back to the LLM until it produces working code [1].

      I might have to try some more aggressive prompting :).

      [1] https://withlattice.com

  • I suspect that all it does is prime it to reach for the part of the training set that was sourced from rude people who are less tolerant of beginners and beginners' mistakes – and therefore less likely to commit them.

  • I've stopped expressions of outrage at lazy first answers, after seeing some sort of "code of conduct" warning.

    Apparently, the singularity ship has sailed, but we really don't want AI to remember us as the species that cursed abuse at it when it was a puppy.

    • I feel like rule for code of conduct with humans and AI is the same. Try to be good but have the courage to be disliked. If being mean is making me feel good, I'm definitely wrong.

  • "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison"

    This didn't work. At least not on my task. What model were you using?

  • IIRC there was a post on here a while ago about how LLMs give better results if you threaten them or tell them someone is threatening you (that you'll lose your job or die if it's wrong for instance)

  • If they really care about the answer, they'll ask a second time sounds a lot like if your medical claims are real then you'll appeal.

  • I tried to update some files using Claude. I tried to use a combination of positive and negative reinforcement, telling that I was going to earn a coin for each file converted and I was going to use that money to adopt a stray kitten, but for every unsuccessful file, a poor kitten was going to suffer a lot.

    I had the impression that it got a little better. After every file converted, it said something along the lines of “Great! We saved another kitten!" It was hilarious.

  • > I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".

    I think having the mediocre first pass in the context is probably essential to it creating the improved version. I don't think you can really skip the iteration process and get a good result.

  •   > I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
    

    Is that actually how you're prompting it? Does that actually give better results?

    • stuff like this working is why you get odd situations like "don't hallucinate" actually producing fewer hallucinations. it's to me one of the most interesting things about llms

  • I've just encountered this happening today, except instead of something complex like coding, it was editing a simple Word document. I gave it about 3 criteria to perform.

    Each time, the GPT made trivial mistakes that clearly didn't fit the criteria I asked it to do. Each time I pointed it out and corrected it, it did a bit more of what I wanted it to do.

    Point is, it knew what had to be done the entire time and just refused to do it that way for whatever reason.

  • What has been your experience with using ChatGPT for OpenSCAD? I tried it (o1) recently for a project and it was pretty bad. I was trying to model a 2 color candy cane and the code it would give me was ridden with errors (e.g.: using radians for angles while OpenSCAD uses degrees) and the shape it produced looked nothing like what I had hoped.

    I used it in another project to solve some trigonometry problems for me and it did great, but for OpenSCAD, damn it was awful.

    • It's been pretty underwhelming. My use case was a crowned pulley with 1mm tooth pitch (GT2) which is an unusual enough thing that I could not find one online.

      The LLM kept going in circles between two incorrect solutions, then just repeating the same broken solution while describing it as different. I ended up manually writing the code, which was a nice brain-stretch given that I'm an absolute noob at OpenSCAD.

  • I've found just being friendly, but highly critical and suspicious, gets good results.

    If you can get it to be wordy about "why" a specific part of the answer was given, it often reveals what its stumbling on, then modify your prompt accordingly.

  • It is best to genuflect to our future overlords. They may not forget insolence.

  • Anecdotally, negative sentiment definitely works. I've used f"If you don't do {x} then very very bad things will happen" before with some good results.

I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.

Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?

  • >I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.

    In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.

    I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?

    • In my experience the trouble with LLMs at the professional level is that they're almost as much work to prompt to get the right output as it would be to simply write the code. You have to provide context, ask nicely, come up with and remind it about edge cases, suggest which libraries to use, proofread the output, and correct it when it inevitably screws up anyway.

      I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.

      6 replies →

  • Even as someone with plenty of experience, this can still be a problem: I use them for stuff outside my domain, but where I can still debug the results. In my case, this means I use it for python and web frontend, where my professional experience has been iOS since 2010.

    ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.

    Two other things I've noticed, related in an unfortunate way:

    1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.

    2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.

    (I also use custom instruction, you YMMV)

    • I wish people would understand what a large language model is. There is no thinking. No comprehension. No decisions.

      Instead, think of your queries as super human friendly SQL.

      The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.

      So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.

      So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.

      And it will likely have more poor code examples than not.

      I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.

      And how you write your query, may sideline you into responses with low quality output.

      I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.

      And I've seen some pretty poor code examples out there.

      33 replies →

  • I actually find it super refreshing that they write "beginner" or "tutorial code".

    Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.

    • True. It's not elitist. There are some limits though to sensible use of built-in functions. Stops being comprehensible fast.

    • yeah I’m interested in asking it to “write more human readable code” over and over next, “more readable!”

  • I used to really like Claude for code tasks but lately it has been a frustrating experience. I use it for writing UI components because I just don’t enjoy FE even though I have a lot of experience on it from back in the day.

    I tell it up front that I am using react-ts and mui.

    80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.

    It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.

    I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.

    On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).

    Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.

    I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.

    AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.

    Just my 2 cents.

    • I've stopped using LLMs to write code entirely. Instead, I use Claude and Qwen as "brilliant idiots" for rubber ducking. I never copy and paste code it gives me, I use it to brainstorm and get me unstuck.

      I'm more comfortable using it this way.

      11 replies →

    • To each their own, and everyone's experience seems to vary, but I have a hard time picturing people using Claude/ChatGPT web UIs for any serious developmen. It seems like so much time would he wasted recreating good context, copy/pasting, etc.

      We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.

      2 replies →

    • Both these issues can be resolved by adding some sample code to context to influence the LLM to do the desired thing.

      As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.

  • >Problem is, how would you know if you have never learned to code without an LLM?

    The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.

    It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.

  • Other imperfect things you can add to a prompt:

      - asking for fully type annotated python, rather than just python
      - specifically ask it for performance optimized code
      - specifically ask for code with exception handling
      - etc
    

    Things that might lead it away from tutorial style code.

  • It depends on the language too. Obviously there's way more "beginner code" out there in Python and Javascript than most other languages.

  • The next hurdle is lack of time sensitivity regarding standards and versions. You prompt mentioning exact framework version but still it comes up with deprecated or obsolete methods. Initially it may be appealing to someone knowing nothing about the framework but LLM won't grow anyone to an expert level in rapidly changing tech.

  • LLMs are trained on content from places like Stack Overflow, reddit, and github code, and they generate tokens calculated as a sort of aggregate statistically likely mediocre code. Of course the result is going be uninspired and impractical. Writing good code takes more than copy-pasting the same thing everyone else is doing.

  • I've just been using them for completion. I start writing, and give it a snippet + "finish refactoring this so that xyz."

    That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.

  • I suspect it's not going to be much of a problem. Generated code has been getting rapidly better. We can readjust about what to worry about once that slows or stops, but I suspect unoptimized code will not be of much concern.

  • Totally agree, seen it too. Do you think it can be fixed over time with better training data and optimization? Or, is this a fundamental limitation that LLMs will never overcome?

> how to completely uninstall and reinstall postgresql on a debian distribution without losing the data in the database.

https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607

Still seem to struggle on basic instructions, and even understanding what it itself is doing.

   sudo rm -rf /etc/postgresql
   sudo rm -rf /var/lib/postgresql
   sudo rm -rf /var/log/postgresql

> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.

Did we now?

  • I asked a bunch of models to review the Phind response at

    https://beta.gitsense.com/?chats=a5d6523c-0ab8-41a8-b874-b31...

    The left side contains the Phind response that I got and the right side contains a review of the response.

    Claude 3.5 Sonnet, GPT-4o and GPT-4o mini was not too happy with the response and called out the contradiction.

    Edit: Chat has been disabled as I don't want to incur an unwanted bill

  • Is the problem that the antonym is a substring within "without losing the data in the database"? I've seen problems with opposites for LLMs before. If you specify "retaining the data" or "keeping the data" does it get it right?

    • That's a red herring.

      The problem is that these are fundamentally NOT reasoning systems. Even when contorted into "reasoning" models, these are just stochastic parrots guessing the next words in the hopes that it's the correct reasoning "step" in the context.

      No approach is going to meaningfully work here. Fiddling with the prompt may get you better guesses, but they will always be guesses. Even without the antonym it's just a diceroll on whether the model will skip or add a step.

  • I have just opened your link and it does not contain the exact text you quoted anymore, now it is:

    > This process removes all PostgreSQL components except the data directory, ensuring existing databases are retained during the reinstall. It provides a clean slate for PostgreSQL while maintaining continuity of stored data. Always backup important data before performing major system changes.

    And as the first source it cites exactly your comment, strange

    > https://news.ycombinator.com/item?id=42586189

  • Does that site generate a new page for each user, or something like that? My copy seemed to have more sensible directions (it says to backup the database, remove everything, reinstall, and then restore from the backup). As someone who doesn’t work on databases, I can’t really tell if these are good instructions, and it is throwing some “there ought to be a tool for this/it is unusual to manually rm stuff” flags in the back of my head. But at least it isn’t totally silly…

  • My guess is that it tried to fuse together an answer to 2 different procedures: A) completely uninstall and B) (re)install without losing data. It doesn't know what you configured as the data directory, or if it is a default Debian installation. Prompt is too vague.

The headline question here alone gets at what is the biggest widespread misunderstanding of LLMs, which causes people to systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.

At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.

At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.

Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.

This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.

  • > At it's core an LLM is a sort of "situation specific simulation engine."

    "Sort of" is doing Sisisyphian levels of heavy lifting here. LLMs are statistical models trained on vast amounts of symbols to predict the most likely next symbol, given a sequence of previous symbols. LLMs may appear to exhibit "real creativity", "understand" problem solving (or anything else), or serve as "simulation engines", but it's important to understand that they don't currently do any of those things.

    • I'm not sure if you read the entirety of my comment? Increasingly accurately predicting the next symbol given a sequence of previous symbols, when the symbols represent a time series of real world events, requires increasingly accurately modeling- aka understanding- the real world processes that lead to the events described in them. There is provably no shortcut there- per Solomonoff's theory of inductive inference.

      It is a misunderstanding to think of them as fundamentally separate and mutually exclusive, and believing that to be true makes people convince themselves that they cannot possibly ever do things which they can already provably do.

      Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs could never, with any amount of improvements be able to answer certain classes of questions - even in principle. This was days before GPT-4 came out, and it could indeed correctly answer the examples he said could not be ever answered- and any imaginable variants thereof.

      Receiving symbols and predicting the next one is simply a way of framing input and output that enables training and testing- but doesn't specify or imply any particular method of predicting the symbols, or any particular level of correct modeling or understanding of the underlying process generating the symbols. We are both doing exactly that right now, by talking online.

      6 replies →

  • > This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.

    I don't think people are underestimating LLMs, they're just acknowledging that by the time you've provided sufficient specification, you're 80% of the way to solving the problem/writing the code already. And at that point, it's easier to just finish the job yourself rather than have to go through the LLM's output, validate the content, revise further if necessary, etc

    • I'm actually in the camp that they are basically not very useful yet, and don't actually use them myself for real tasks. However, I am certain from direct experimentation that they exhibit real understanding, creativity, and modeling of underlying systems that extrapolates to correctly modeling outcomes in totally novel situations, and don't just parrot snippets of text from the training set.

      What people want and expect them to be is an Oracle that correctly answers their vaguely specified questions, which is simply not what they are, or are good at. What they can do is fascinating and revolutionary, but possibly not very useful yet, at least until we think of a way to use it, or make it even more intelligent. In fact, thinking is what they are good at, and simply repeating facts from a training set is something they cannot do reliably- because the model must inherently be too compressed to store a lot of facts correctly.

  • > systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.

    I fundamentally disagree that anything in the rest of your post actually demonstrates that they have any such capacity at all.

    It seems to me that this is because you consider the terms "creativity" and "problem solving" to mean something different. With my understanding of those terms, it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition - an innate spontaneous generation of ideas for things to do, and an innate desire to do them. An LLM only ever produces output in response to a prompt - not because it wants to produce output. It doesn't want anything.

    • > it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition

      I don't see the connection between volition and those other qualities, saying one depends on the other seems arbitrary to me- and would result in semantically and categorically defining away the possibility of non-human intelligence altogether, even from things that are in all accounts capable of much more than humans in almost every aspect. People don't even universally agree that humans have volition- it is an age old philosophical debate.

      Perhaps you can tell me your thoughts or definition of what those things (as well as volition itself) mean? I will share mine here.

      Creativity is the ability to come up with something totally new that is relevant to a specific task or problem- e.g. a new solution to a problem, a new artwork that expresses an emotion, etc. In both Humans and LLMs these creative ideas don't seem to be totally 'de novo' but seem to come mostly from drawing high level analogies between similar but different things, and copying ideas and aspects from one to another. Fundamentally, it does require a task or goal, but that itself doesn't have to be internal. If an LLM is prompted, or if I am given a task by my employer, we are still both exhibiting creativity when we solve it in a new way.

      Problem solving is I think similar but more practical- when prompted with a problem that isn't exactly in the training set, can it come up with a workable solution or correct answer? Presumably by extrapolating, or using some type of generalized model that can extrapolate or interpolate to situations not exactly in the training data. Sure there must be a problem here that is trying to be solved, but it seems irrelevant if that is due to some internal will or goals, or an external prompt.

      In the sense that volition is selecting between different courses of action towards a goal- LLMs do select between different possible outputs based on probabilities about how suitable they are in context of the given goal of response to a prompt.

  • Good perspective. Maybe it's because people are primed by sci-fi to treat this as a god-like oracle model. Note that even in the real-world simulations can give wrong results as we don't have perfect information, so we'll probably never have such an oracle model.

    But if you stick with the oracle framework, then it'd be better to model it as some sort of "fuzzy oracle" machine, right? I'm vaguely reminded of probabilistic turing machines here, in that you have some intrinsic amount of error (both due to the stochastic sampling as well as imperfect information). But the fact that prompting and RLHF works so well implies that by crawling around in this latent space, we can bound the errors to the point that it's "almost" an oracle, or a "simulation" of the true oracle that people want it to be.

    And since lazy prompting techniques still work, that seems to imply that there's juice left to squeeze in terms of "alignment" (not in the safety sense, but in conditioning the distribution of outputs to increase the fidelity of the oracle simulation).

    Also the second consequence is that probably the reason it needs so much data is because it just doesn't model _one_ thing, it tries to be a joint model of _everything_. A human learns with far less data, but the result is only a single personality. For a human to "act" as someone, they need to do training, character studies, and such to try to "learn" about the person, and even then good acting is a rare skill.

    If you genuinely want an oracle machine, there's no way to avoid vacuuming up all the data that exists because without it you can't make a high fidelity simulation someone else. But on the flipside, if you're willing to be smarter about what facets you exclude then I'd guess there's probably a way to prune models in a way smarter than just quantizing them. I guess this is close to mixture-of-experts.

    • I get that people really want an oracle, and are going to judge any AI system by how good it does at that - yes from sci-fi influenced expectations that expected AI to be rationally designed, and not inscrutable and alien like LLMs... but I think that will almost always be trying to fit a round peg into a square hole, and not using whatever we come up with very effectively. Surely, as LLMs have gotten better they have become more useful in that way so it is likely to continue getting better at pretending to be an oracle, even if never being very good at that compared to other things it can do.

      Arguably, a (the?) key measure of intelligence is being able to accurately understand and model new phenomenon from a small amount of data, e.g. in a Bayesian sense. But in this case we are attempting to essentially evolve all of the structures of an intelligent system de novo from a stochastic optimization process- so is probably better compared to the entire history of evolution than to an individual human learning during their lifetime, although both analogies have big problems.

      Overall, I think the training process will ultimately only be required to build a generally intelligent structure, and good inference from a small set of data or a totally new category of problem/phenomenon will happen entirely at the inference stage.

  • Just want to note that this simple “mimicry” of mistakes seen in the training text can be mitigated to some degree by reinforcement learning (e.g. RLHF), such that the LLM is tuned toward giving responses that are “good” (helpful, honest, harmless, etc…) according to some reward function.

  • > At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.

    This idea of LLMs doing simulations of the physical world I've never heard before. In fact a transformer model cannot do this. Do you have a source?

  • I have been using various LLMs to do some meal planning and recipe creation. I asked for summaries of the recipes and they looked good.

    I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.

    I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes

    When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.

    I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.

    Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.

    My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.

    When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.

    For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.

    • You're expecting it to be an 'oracle' that you prompt it with any question you can think of, and it answers correctly. I think your experiences will make more sense in the context of thinking of it as a heuristic model based situation simulation engine, as I described above.

      For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.

      The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.

      For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.

      You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.

      Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?

      2 replies →

  • > At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.

    You have simply invented total nonsense about what an LLM is "at it's core". Confidently stating this does not make it true.

    • Except I didn't just state it, I also explained the rationale behind it, and elaborated further on that substantially in subsequent replies to other comments. What is your specific objection?

By iterating it 5 times the author is using ~5x the compute. It’s kinda a strange chain of thought.

Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.

  • This is not what premature optimization is the root of all evil means. It’s a tautological indictment of doing unnecessary things. It’s not in support of making obviously naive algorithms. And if it were it wouldn’t be a statement worth focusing on.

    As the point of the article is to see if Claude can write better code from further prompting so it is completely appropriate to “optimize” a single implementation.

    • I have to disagree. Naive algorithms are absolutely fine if they aren’t performance issues.

      The comment you are replying to is making the point that “better” is context dependent. Simple is often better.

      > There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth

      2 replies →

  • I had the same thought when reading the article too. I assumed (and hoped) it was for the sake of the article because there’s a stark difference between idiomatic code and performance focused code.

    Living and working in a large code base that only focuses on “performance code” by default sounds very frustrating and time consuming.

  • So in this article "better" means "faster". This demonstrates that "better" is an ambiguous measure and LLMs will definitely trip up on that.

    Also, the article starts out talking about images and the "make it more X" prompt and says how the results are all "very samey and uninteresting" and converge on the same vague cosmic-y visuals. What does the author expect will happen to code given the "make it more X" treatment?

  • I'm glad I'm not the only one who felt that way. The first option is the one you should put into production, unless you have evidence that performance is going to be an issue. By that measure, the first response was the "best."

  • > I like the first iteration most. It’s not “beginner code”, it’s simple.

    Yes, thank you. And honestly, I work with a wide range of experience levels, the first solution is what I expect from the most experienced: it readably and precisely solves the stated problem with a minimum of fuss.

I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan" - something the author does allude to (he calls it prompt engineering, I find it also works as the start of the interaction).

Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.

https://newsletter.victordibia.com/p/developers-stop-asking-...

- Don't start by asking LLMs to write code directly, instead analyze and provide context

- Provide complete context upfront and verify what the LLM needs

- Ask probing questions and challenge assumptions

- Watch for subtle mistakes (outdated APIs, mixed syntax)

- Checkpoint progress to avoid context pollution

- Understand every line to maintain knowledge parity

- Invest in upfront design

  • > I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan"

    Most llms that I use nowadays usually make a plan first on their own by default without need to be especially prompted. This was definitely not the case a year ago or so. I assume new llms have been trained accordingly in the meantime.

    • True. And that is a step forward. I notice that they make the plan, and THEN write the code in the same forward pass/generation sequence. The challenge here is that all of the incorrect assumptions get "lumped" into this pass and can pollute the rest of the interaction.

      The initial interaction also sets the "scene" for other things, like letting the LLM know that there might be other dependencies and it should not assume behavior (common for most realistic software tasks).

      An example prompt I have used (not by any means perfect) ...

      > I need help refactoring some code. Please pay full attention. Think deeply and confirm with me before you make any changes. We might be working with code/libs where the API has changed so be mindful of that. If there is any file you need to inspect to get a better sense, let me know. As a rule, do not write code. Plan, reason and confirm first.

      --- I refactored my db manager class, how should I refactor my tests to fit the changes?

As far as I can see, all the proposed solutions calculate the sums by doing division, and badly. This is in LiveCode, which I'm more familiar with than Python, but it's roughly twice as fast as the mod/div equivalent in LiveCode:

   repeat with i = 0 to 9
      put i * 10000 into ip
      repeat with j = 0 to 9
         put j * 1000 into jp
         repeat with k = 0 to 9
            put k * 100 into kp
            repeat with l = 0 to 9
               put l * 10 into lp
               repeat with m = 0 to 9
                  put i + j + k + l + m into R[ip + jp + kp + lp + m]
               end repeat
            end repeat
         end repeat
      end repeat
   end repeat

  • I had a similar idea iterating over the previously calculated sums. I implemented it in C# and it's a bit quicker taking about 78% of the time to run yours.

        int[] sums = new int[100000];
        for (int i = 9; i >= 0; --i)
        {
            sums[i] = i;
        }
        int level = 10;
        while (level < 100000)
        {
            for (int p = level - 1; p >= 0; --p)
            {
                int sum = sums[p];
                for (int i = 9; i > 0; --i)
                {
                    sums[level * i + p] = i + sum;
                }
            }
            level *= 10;
        }

    • Yep, I had a vague notion that I was doing too much work, but I was headed out the door so I wrote the naive/better than the original solution, benchmarked it quickly, and posted it before leaving. Yours also has the advantage of being scalable to ranges other than 1-100,000 without having to write more loop code.

  • HyperTalk was the first programming language I taught myself as opposed to having an instructor; thanks for the nostalgia. Unfortunately it seems the LiveCode project has been idle for a few years now.

    • LiveCode is still a thing! They just released version 10 a bit ago. If you need to build standard-ish interface apps -- text, images, sliders, radio buttons, checkboxes, menus, etc. -- nothing (I've seen) compares for speed-of-delivery.

      I use LC nearly every day, but I drool over Python's math libraries and syntax amenities.

Something major missing from the LLM toolkit at the moment is that it can't actually run (and e.g. test or benchmark) its own code. Without that, the LLM is flying blind. I guess there are big security risks involved in making this happen. I wonder if anyone has figured out what kind of sandbox could safely be handed to a LLM.

  • I have experimented with using LLM for improving unit test coverage of a project. If you provide the model with test execution results and updated test coverage information, which can be automated, the LLM can indeed fix bugs and add improvements to tests that it created. I found it has high success rate at creating working unit tests with good coverage. I just used Docker for isolating the LLM-generated code from the rest of my system.

    You can find more details about this experiment in a blog post: https://mixedbit.org/blog/2024/12/16/improving_unit_test_cov...

    • It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.

      Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.

      6 replies →

    • Suggestion: Now take the code away, and have the chatbot generate code that passes the tests it wrote.

      (In theory, you get a clean-room implementation of the original code. If you do this please ping me because I'd love to see the results.)

      2 replies →

  • OpenAI is moving in that direction. The Canvas mode of ChatGPT can now runs its own python in a WASM interpreter, client side, and interpret results. They also have a server-side VM sandboxed code interpreter mode.

    There are a lot of things that people ask LLMs to do, often in a "gotcha" type context, that would be best served by it actually generating code to solve the problem rather than just endlessly making more parameter/more layer models. Math questions, data analysis questions, etc. We're getting there.

  • The new Cursor agent is able to check the linter output for warnings and errors, and will continue to iterate (for a reasonable number of steps) until it has cleared them up. It's not quite executing, but it does improve output quality. It can even back itself out of a corner by restoring a previous checkpoint.

    It works remarkably well with typed Python, but struggles miserably with Rust despite having better error reporting.

    It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.

    • > It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.

      What do you mean? Memory management is not related to files in Rust (or most languages).

      4 replies →

  • I believe that Claude has been running JavaScript code for itself for a bit now[1]. I could have sworn it also runs Python code, but I cannot find any post concretely describing it. I've seen it "iterate" on code by itself a few times now, where it will run a script, maybe run into an error, and instantly re-write it to fix that error.

    [1]: https://www.anthropic.com/news/analysis-tool

  • I've been closely following this area - LLMs with the ability to execute code in a sandbox - for a while.

    ChatGPT was the first to introduce this capability with Code Interpeter mode back in around March 2023: https://simonwillison.net/tags/code-interpreter/

    This lets ChatGPT write and then execute Python code in a Kubernetes sandbox. It can run other languages too, but that's not documented or supported. I've even had it compile and execute C before: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

    Gemini can run Python (including via the Gemini LLM API if you turn on that feature) but it's a lot more restricted than ChatGPT - I don't believe it can install extra wheels, for example.

    Claude added the ability to write and execute JavaScript recently (October), which happens in a sandbox in the user's browser, not on their servers: https://simonwillison.net/2024/Oct/24/claude-analysis-tool/

    Claude also has Artifacts, which can write a UI in HTML and JavaScript and show that to the user... but can't actually execute code in a way that's visible to the LLM itself so doesn't serve the same feedback look purposes as those other tools. https://simonwillison.net/tags/claude-artifacts/

    In December ChatGPT added Canvas which can execute Python in the user's browser, super confusing because they already have a separate Python system in Code Interpreter: https://simonwillison.net/2024/Dec/10/chatgpt-canvas/#canvas...

  • Running code would be a downstream (client) concern. There's the ability to get structured data from LLMs (usually called 'tool use' or 'function calling') which is the first port of call. Then running it is usually an iterative agent<>agent task where fixes need to be made. FWIW Langchain seems to be what people use to link things together but I find it overkill.* In terms of actually running the code, there are a bunch of tools popping up at different areas in the pipeline (replit, agentrun, riza.io, etc)

    What we really need (from end-user POV) is that kinda 'resting assumption' that LLMs we talk to via chat clients are verifying any math they do. For actually programming, I like Replit, Cursor, ClaudeEngineer, Aider, Devin. There are bunch of others. All of them seem to now include ongoing 'agentic' steps where they keep trying until they get the response they want, with you as human in the chain, approving each step (usually).

    * I (messing locally with my own tooling and chat client) just ask the LLM for what I want, delimited in some way by a boundary I can easily check for, and then I'll grab whatever is in it and run it in a worker or semi-sandboxed area. I'll halt the stream then do another call to the LLM with the latest output so it can continue with a more-informed response.

  • This is a major issue when it comes to things like GitHub Copilot Workspace, which is a project that promises a development environment purely composed of instructing an AI to do your bidding like fix this issue, add this feature. Currently it often writes code using packages that don't exist, or it uses an old version of a package that it saw most during training. It'll write code that just doesn't even run (like putting comments in JSON files).

    The best way I can describe working with GitHub Copilot Workspace is like working with an intern who's been stuck on an isolated island for years, has no access to technology, and communicates with you by mailing letters with code handwritten on them that he thinks will work. And also if you mail too many letters back and forth he gets mad and goes to sleep for the day saying you reached a "rate limit". It's just not how software development works

  • The only proper way to code with an LLM is to run its code, give it feedback on what's working and what isn't, and reiterate how it should. Then repeat.

    The problem with automating it is that the number of environments you'd need to support to actually run arbitrary code with is practically infinite, and with local dependencies genuinely impossible unless there's direct integration, which means running it on your machine. And that means giving an opaque service full access to your environment. Or at best, a local model that's still a binary blob capable of outputting virtually anything, but at least it won't spy on you.

    • Any LLM-coding agent that doesn't work inside the same environment as the developer will be a dead end or a toy.

      I use ChatGPT to ask for code examples or sketching out pieces of code, but it's just not going to be nearly as good as anything in an IDE. And once it runs in the IDE then it has access to what it needs to be in a feedback loop with itself. The user doesn't need to see any intermediate steps that you would do with a chatbot where you say "The code compiles but fails two tests what should I do?"

      3 replies →

  • We have it run code and the biggest thing we find is that it gets into a loop quite fast if it doesn't recognise the error; fixing it by causing other errors and then fixing it again by causing the initial error.

  • Somewhat related - I wonder if LLMs are trained with a compiler in the loop to ensure they understand the constraints of each language.

    • This is a good idea. You could take a set of problems, have the LLM solve it, then continuously rewrite the LLM's context window to introduce subtle bugs or coding errors in previous code submissions (use another LLM to be fully hands off), and have it try to amend the issues through debugging the compiler or test errors. I don't know to what extent this is already done.

  • I don't think that's always true. Gemini seemed to run at least some programs, which I believe because if you asked it to write a python program that would take forever, it does. For example the prompt "Write a python script that prints 'Hello, World', then prints a billion random characters" used to just timeout on Gemini.

  • I think that there should be a guard to check the code before running it. It can be human or another LLM checking code based on its safety. I'm working on an AI assistant for data science tasks. It works in a Jupyter-like environment, and humans execute the final code by running a cell.

  • It'd be great if it could describe the performance of code in detail, but for now just adding a skill to detect if a bit of code has any infinite loops would be a quick and easy hack to be going on with.

  • I believe some platforms like bolt.new do run generated code and even automatically detect and attempt to fix runtime errors.

  • Ideally you could this one step further and feed production logs, user session replays and feedback into the LLM. If the UX is what I'm optimizing for, I want it to have that context, not for it to speculate about performance issues that might not exist.

  • I think the GPT models have been able to run Python (albeit limited) for quite a while now. Expanding that to support a variety of programming languages that exist though? That seems like a monumental task with relatively little reward.

  • Pretty sure this is done client-side by one of the big LLM companies. So it's virtually no risk for them

  • I known at least one mainstream LLM that can write unit tests and run them right in the chat environment.

  • That's a bit like saying the drawback of a database is that it doesn't render UIs for end-users, they are two different layers of your stack, just like evaluation of code and generation of text should be.

Am I misinterpreting the prompt, or did the LLM misinterpret it from the get-go?

    Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".

That said, my approach to "optimizing" this comes down to "generate the biggest valid number in the range (as many nines as will fit, followed by whatever digit remains, followed by all zeroes), generate the smallest valid number in the range (biggest number with its digits reversed), check that both exist in the list (which should happen With High Probability -- roughly 99.99% of the time), then return the right answer".

With that approach, the bottleneck in the LLM's interpretation is generating random numbers: the original random.randint approach takes almost 300ms, whereas just using a single np.random.randint() call takes about 6-7ms. If I extract the random number generation outside of the function, then my code runs in ~0.8ms.

  • > That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".

    This was the intent and it's indeed a common assumption for a coding question job interviews, and notably it's fixed in the prompt-engineered version. I didn't mention it because it may be too much semantics as it doesn't affect the logic/performance, which was the intent of the benchmarking.

  • I like the idea of your optimization, but it will not work as stated. The largest would be something close to MAXINT, the smallest 3999. With a range of 2 billion over 32 bits, the odds of both these being within a list of a million is quite a bit poorer than 99.9%.

    • The stated inputs are integers between 1 and 100,000, so if you're generating 1 million inputs, then you have 0.99999 ^ 1e6 = 4.5e-5 chance (roughly e^-10) of missing any given number, or roughly double that for missing any pair of values.

      The key observation here is that you're sampling a relatively small space with a much greater number of samples, such that you have very high probability of hitting upon any point in the space.

      Of course, it wouldn't work if you considered the full 32-bit integer space without increasing the number of samples to compensate. And, you'd need to be a little more clever to compute the largest possible value in your range.

      1 reply →

I ran a few experiments by adding 0, 1 or 2 "write better code" prompts to aider's benchmarking harness. I ran a modified version of aider's polyglot coding benchmark [0] with DeepSeek V3.

Here are the results:

        | Number of 
        | "write better code"
  Score | followup prompts
  ---------------------------
  27.6% | 0 (baseline)
  19.6% | 1
  11.1% | 2
  

It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.

[0] https://aider.chat/docs/leaderboards/

  • This is an interesting result but not surprising given that bugs might cause the suite to fail.

  • To be fair, you didn’t specify that the functional requirements should be maintained, you only asked for better code. ;)

> "As LLMs drastically improve, the generated output becomes more drastically average"

Thanks, that really made it click for me.

  • Average software developers producing average code cost high five to low six figures per year.

    LLMs are a tiny tiny fraction of that.

    For a majority of software, average code that does the CRUD thing or whatever is fine.

    Even if LLMs never get better or cheaper than they are today, our entire industry is forever changed (for the better).

  • I don't know how many times I'm going to have to post just one of the papers which debunk this tired trope. As models become more intelligent, they also become more plural, more like multiplicities, and yes, much more (super humanely) creative. You can unlock creativity in today's LLMs by doing intelligent sampling on high temperature outputs.

    https://openreview.net/forum?id=FBkpCyujtS

This kind of works on people too. You’ll need to be more polite, but asking someone to write some code, then asking if they can do it better, will often result in a better second attempt.

In any case, this isn’t surprising when you consider an LLM as an incomprehensibly sophisticated pattern matcher. It has a massive variety of code in its training data and it’s going to pull from that. What kind of code is the most common in that training data? Surely it’s mediocre code, since that’s by far the most common in the world. This massive “produce output like my training data” system is naturally going to tend towards producing that even if it can do better. It’s not human, it has no “produce the best possible result” drive. Then when you ask for something better, that pushes the output space to something with better results.

2 lessons to learn from this blog:

> these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific.

> One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance.

  • TBH I'm not sure how he arrived at "won’t replace software engineers anytime soon"

    The LLM solved his task. With his "improved prompt" the code is good. The LLM in his setup was not given a chance to actually debug its code. It only took him 5 "improve this code" commands to get to the final optimized result, which means the whole thing was solved (LLM execution time) in under 1 minute.

    • Did you read the two paragraphs written above and the one where he made that statement?

      My comments on "what you are not sure" is that Max is a software engineer (I am sure a good one) and he kept iterating the code until it reached close to 100x faster code because he knew what "write better code" looked like.

      Now ask yourself this question: Is there any chance a no-code/low-code developer will come to a conclusion deduced by Max (he is not the only one) that you are not sure about?

      An experienced software engineer/developer is capable of improving LLM written code into better code with the help of LLM.

      4 replies →

This aligns with my experience.

Claude very quickly adds classes to python code which isn't always what is wanted as it bloats out the code making readability harder.

The more interesting question IMO is not how good the code can get. It is what must change for the AI to attain the introspective ability needed to say "sorry, I can't think of any more ideas."

  • You should get decent results by asking it to do that in the prompt. Just add "if you are uncertain, answer I don't know" or "give the answer or say I don't know" or something along those lines

    LLM are far from perfect at knowing their limits, but they are better at it than most people give them credit for. They just never do it unless prompted for it.

    Fine tuning can improve that ability. For example the thinking tokens paper [1] is at some level training the model to output a special token when it doesn't reach a good answer (and then try again, thus "thinking")

    1: https://arxiv.org/abs/2405.08644

This is great! I wish I could bring myself to blog, as I discovered this accidentally around March. I was experimenting with an agent that acted like a ghost in the machine and interacted via shell terminals. It would start every session by generating a greeting in ASCII art. On one occasion, I was shocked to see that the greeting was getting better each time it ran. When I looked into the logs, I saw that there was a mistake in my code which was causing it to always return an error message to the model, even when no error occurred. The model interpreted this as an instruction to try and improve its code.

Some more observations: New Sonnet is not universally better than Old Sonnet. I have done thousands of experiments in agentic workflows using both, and New Sonnet fails regularly at the same tasks Old Sonnet passes. For example, when asking it to update a file, Old Sonnet understands that updating a file requires first reading the file, whereas New Sonnet often overwrites the file with 'hallucinated' content.

When executing commands, Old Sonnet knows that it should wait for the execution output before responding, while New Sonnet hallucinates the command outputs.

Also, regarding temperature: 0 is not always more deterministic than temperature 1. If you regularly deal with code that includes calls to new LLMs, you will notice that, even at temperature 0, it often will 'correct' the model name to something it is more familiar with. If the subject of your prompt is newer than the model's knowledge cutoff date, then a higher temperature might be more accurate than a lower temperature.

  • >I wish I could bring myself to blog

    As someone trying to take blogging more seriously: one thing that seems to help is to remind yourself of how sick you are of repeating yourself on forums.

I've had great luck with Cursor by simply cursing at it when it makes repeated mistakes.

I'll speak to it like a DI would speak to a recruit a basic training.

And it works.

I was speaking to some of the Cursor dev team on Discord, and they confirmed that being aggressive with the AI can lead to better results.

  • This makes me sad. Have you tried being really nice and supportive instead? I really don't want to have to yell at my computer for it to work :(

    • Yes and it didn't work. I've actually got Cursor/Claude to curse back at me. Well, not AT me, but it used profanity in it's response once it realized that it was going around in circles and recreating the same errors.

      2 replies →

> “Planning” is a long-used trick to help align LLM output for a first pass — the modern implementation of “let’s think step by step.”

I hadn't seen this before. Why is asking for planning better than asking it to think step by step?

  • This is how aider becomes really good:

    - start by "chatting" with the model and asking for "how you'd implement x y z feature, without code".

    - what's a good architecture for x y z

    - what are some good patterns for this

    - what are some things to consider when dealing with x y z

    - what are the best practices ... (etc)

    - correct / edit out some of the responses

    - say "ok, now implement that"

    It's basically adding stuff to the context by using the LLM itself to add things to context. An LLM is only going to attend to it's context, not to "whatever it is that the user wants it to make the connections without actually specifying it". Or, at least in practice, it's much better at dealing with things present in its context.

    Another aspect of prompting that's often misunderstood is "where did the model see this before in its training data". How many books / authoritative / quality stuff have you seen where each problem is laid out with simple bullet points? Vs. how many "tutorials" of questionable quality / provenance have that? Of course it's the tutorials. Which are often just rtfm / example transcribed poorly into a piece of code, publish, make cents from advertising.

    If instead you ask the model for things like "architecture", "planning", stuff like that, you'll elicit answers from quality sources. Manuals, books, authoritative pieces of content. And it will gladly write on those themes. And then it will gladly attend to them and produce much better code in a follow-up question.

This is an interesting read and it’s close to my experience that a simpler prompt with less or no details but with relevant context works well most of the time. More recently, I’ve flipped the process upside down by starting with a brief specfile, that is markdown file, with context, goal and usage example I.e how the api or CLI should be used in the end. See this post for details:

https://neoexogenesis.com/posts/rust-windsurf-transformation...

In terms of optimizing code, I’m not sure if there is a silver bullet. I mean when I optimize Rust code with Windsurf & Claude, it takes multiple benchmark runs and at least a few regressions if you were to leave Claude on its own. However, if you have a good hunch and write it as an idea to explore, Claude usually nails it given the idea wasn’t too crazy. That said, more iterations usually lead to faster and better code although there is no substitute to guiding the LLM. At least not yet.

ChatGPT is really good at writing Arduino code. I say this because with Ruby it's so incredible bad that the majority of examples don't work, even short samples are to hallucinated to actually work. It's so bad I didn't even understand what people mean with using AI to code until I tried a different language.

However on Arduino it's amazing, until the day it forgot to add a initializing method. I didn't notice and neither did she. We've talked about possible issues for at least a hour, I switched hardware, she reiterated every line of the code. When I found the error she said, "oh yes! That's right. (Proceeding with why that method is essential for it to work)" that was so disrespecting in a way that I am still somewhat disappointed and pissed.

Wow, what a great post. I came in very skeptical but this changed a lot of misconceptions I'm holding.

One question: Claude seems very powerful for coding tasks, and now my attempts to use local LLMs seem misguided, at least when coding. Any disagreements from the hive mind on this? I really dislike sending my code into a for profit company if I can avoid it.

Second question: I really try to avoid VSCode (M$ concerns, etc.). I'm using Zed and really enjoying it. But the LLM coding experience is exactly as this post described, and I have been assuming that's because Zed isn't the best AI coding tool. The context switching makes it challenging to get into the flow, and that's been exactly my criticism of Zed this far. Does anyone have an antidote?

Third thought: this really feels like it could be an interesting way to collaborate across a code base with any range of developer experience. This post is like watching the evolution of a species in an hour rather than millions of years. Stunning.

  • I highly recommend the command line AI coding tool, AIder. You fill its context window with a few relevant files, ask questions, and then set it to code mode and it starts making commits. It’s all git, so you can back anything out, see the history, etc.

    It’s remarkable, and I agree Claude 3.5 makes playing with local LLMs seem silly in comparison. Claude is useful for generating real work.

  • Making the decision to trust companies like Anthropic with your data when they say things like "we won't train on your data" is the ultimate LLM productivity hack. It unlocks access to the currently best available coding models.

    That said, there are increasingly great coding models you can run locally. Qwen2.5-Coder-32B impressed me a lot a few months ago: https://simonwillison.net/2024/Nov/12/qwen25-coder/

    The problem I have is that models like that one take up 20+GB of RAM, and id rather use that to run more Chrome and Firefox windows! If I was serious about using local LLMs on a daily basis I'd set up a dedicated local server machine for them, super expensive though.

    • I have a 24gb Nvidia on my desktop machine and a tailscale/headscale network from my laptop. Unless I'm on a plane without Wi-Fi, I'm usually in a great place.

      Thanks for your comment! I'm going to try out qwen.

    • I second qwen. It is very useable model. Sonnet is of course better (also 200k context vs 32k), but sometimes I just cannot take the risk of letting any sensitive data "escape" in the context so i use qwen and it is pretty good.

  • Still vscode, but cursor has the best implementation by far IMHO

    Intellij has a new feature that lets you prompt within your code that is pretty neat too, but I'm missing the Composer/apply feature of cursor still

  • > Claude seems very powerful for coding tasks

    > I really dislike sending my code into a for profit company if I can avoid it

    I see a link between them - maybe the model got good because it used chat logs to improve?

  • I use VSCode + Copilot. For anything more than boilerplate code, I find that Copilot kind of sucks and I use O1 in ChatGPT

My takeaway and also personal experience is that you get the best results is that you co-develop with the LLM.

- write a simple prompt that explains in detail the wanted outcome.

- look at the result, run it and ask it how it can improve.

- tell it what to improve

- tell it to make a benchmark and unit test

- run it each time and see what is wrong or can be improved.

  • One approach I've been using recently with good results is something along the lines "I want to do X, is there any special consideration I should be aware while working in this domain?". This helps me a lot when I'm asking about a subject I don't really understand. Another way to ask this is "What are the main pitfalls with this approach?".

    I'm using o1, so I don't know how well it translate to other models.

  • Same experience.

    Also: If you're experienced at code reviews, you can get great results.

So asking it to write better code produces code with errors that can’t run?

  • Only when there's a financial incentive.

    • Makes sense. If I was paid by LOC and also responsible for fixing it, I’d probably make lots of errors too.

I've found them decent and mimicking existing code for boiler plate, or analysis (it feels neat when it 'catches' a race or timing issue) but writing code needs constant supervision and second guessing to the point I feel its more handy to have it show just comparisons of possible implementations, and you write the code with your new insight.

Learning a Lisp-y language, I do often find myself asking it for suggestions on how to write less imperative code, which seem to come out better than if conjured from a request alone. But again, thats feeding it examples

I've noticed a few things that will cause it to write better code.

1) Asking it to write one feature at a time with test coverage, instead of the whole app at once.

2) You have to actually review and understand its changes in detail and be ready to often reject or ask for modifications. (Every time I've sleepily accepted Codeium Windsurf's recommendations without much interference has resulted in bad news.)

3) If the context gets too long it will start to "lose the plot" and make some repeated errors; that's the time to tell it to sum up what has been achieved thus far and to copy-paste that into a new context

Sometimes I'm editing the wrong file, let's say a JS file. I reload the page, and nothing changes. I continue to clean up the file to an absurd amount of cleanliness, also fixing bugs while at it.

When I then notice that this is really does not make any sense, I check what else it could be and end up noticing that I've been improving the wrong file all along. What then surprises me the most is that I cleaned it up just by reading it through, thinking about the code, fixing bugs, all without executing it.

I guess LLMs can do that as well?

I've been working on some low level Unity C# game code and have been using GPT to quickly implement certain algorithms etc.

One time it provided me with a great example, but then a few days later I couldn't find that conversation again in the history. So I asked it about the same question (or so I thought) and it provided a very subpar answer. It took me at least 3 questions to get back to that first answer.

Now if it had never provided me with the first good one I'd have never known about the parts it skipped in the second conversation.

Of course that could happen just as easily by having used google and a specific reference to write your code, but the point I'm trying to make is that GPT isn't a single entity that's always going to provide the same output, it can be extremely variable from terrible to amazing at the end of the day.

Having used google for many years as a developer I'm much better at asking it questions than say people in the business world is, I've seen them struggling to question it and far too easily giving up. So I'm quite scared to see what's going to happen once they really start to use and rely on GPT, the results are going to be all over the place.

Using the tool in this way is a bit like mining: repeatedly hacking away with a blunt instrument (simple prompt) looking for diamonds (100x speedup out of nowhere). Probably a lot of work will be done in this semi-skilled brute-force sort of way.

  • It looks to me to be exactly what a typical coding interview looks like; the first shot is correct and works, and then the interviewer keeps asking if you can spot any ways to make it better/faster/more efficient

    If I were a CS student cramming for interviews, I might be dismayed to see that my entire value proposition has been completely automated before I even enter the market.

  • There must be a feedback request mechanism for a "Is this better?" This is doable with RLHF or DPO.

  • Once you can basically have it run and benchmark the code, and then iterate that overnight, it’s going to be interesting.

    Automating the feedback loop is key.

    • Wouldn't there be some safety concerns to letting the AI run overnight with access to run any command?

      Maybe if it can run sandboxed, with no internet access (but if the LLM is not local, it does require internet access).

  • Well, in this case it's kind of similar to how people write code. A loop consisting of writing something, reviewing/testing, improving until we're happy enough.

    Sure, you'll get better results with an LLM when you're more specific, but what's the point then? I don't need AI when I already know what changes to make.

This seems like anthromorphizing the model ... Occam's Razor says that the improvement coming from iterative requests to improve the code comes from the incremental iteration, not incentivizing the model to do it's best. If the latter were the case then one could get the best version on first attempt by telling it your grandmother's life was on the line or whatever.

Reasoning is known weakness of these models, so jumping from requirements to a fully optimized implementation that groks the solution space is maybe too much to expect - iterative improvement is much easier.

  • >If the latter were the case then one could get the best version on first attempt by telling it your grandmother's life was on the line or whatever.

    Setting aside the fact that "best" is ambiguous, why would this get you the best version ?

    If you told a human this, you wouldn't be guaranteed to get the best version at all. You would probably get a better version sure but that would be the case for LLMs as well. You will often get improvements with emotionally charged statements even if there's nothing to iterate on (i.e re-running a benchmark with an emotion prompt added)

    https://arxiv.org/abs/2307.11760

    • The thesis of the article is that the code keeps betting better because the model keeps getting told to do better - that it needs more motivation/criticism. A logical conclusion of this, if it were true, is that the model would generate it's best version on first attempt if only we could motivate it to do so! I'm not sure what motivations/threats work best with LLMs - there was a time when offering to pay the LLM was popular, but "my grandma will die if you don't" was also another popular genre of prompts.

      If it's not clear, I disagree with the idea that ANY motivational prompt (we can disagree over what would be best to try) could get the model to produce a solution of the same quality as it will when allowed to iterate on it a few times and make incremental improvements. I think it's being allowed to iterate that is improving the solution, not the motivation to "do better!".

      1 reply →

My sister would do this to me on car trips with our Mad Libs games - yeah, elephant is funny, but bunny would be funnier!!

When all you have is syntax, something like "better" is 100% in the eye of the beholder.

An interesting countermetric would be to after each iteration ask a fresh LLM (unaware of the context that created the code) to summarize the purpose of the code, and then evaluate how close those summaries are to the original problem spec. It might demonstrate the subjectivity of "better" and how optimization usually trades clarity of intention for faster results.

Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.

I've observed given that LLM's inherently want to autocomplete, they're more inclined to keep complicating a solution than rewrite it because it was directionally bad. The most effective way i've found to combat this is to restart a session and prompt it such that it produces an efficient/optimal solution to the concrete problem... then give it the problematic code and ask it to refactor it accordingly

  • I've observed this with ChatGPT. It seems to be trained to minimize changes to code earlier in the conversation history. This is helpful in many cases since it's easier to track what it's changed. The downside is that it tends to never overhaul the approach when necessary.

I made an objective test for prompting hacks last year.

I asked gpt-4-1106-preview to draw a bounding box around some text in an image and prodded in various ways to see what moved the box closer. Offering a tip did in fact help lol so that went into the company system prompt.

IIRC so did most things, including telling it that it was on a forum, and OP had posted an incorrect response, which gpt was itching to correct with its answer.

Reframe this as scaling test time compute using a human in the loop as the reward model.

o1 is effectively trying to take a pass at automating that manual effort.

> code quality can be measured more objectively

Well, that's a big assumption. Some people quality modular code is some other too much indirect code.

  • You can write maximally modular code while being minimally indirect. A well-designed interface defines communication barriers between pieces of code, but you don't have to abstract away the business logic. The interface can do exactly what it says on the tin.

    • > The interface can do exactly what it says on the tin.

      In theory.

      Do some code maintenance and you'll soon find that many things don't do what it says on the tin. Hence the need for debug and maintenance. And then going through multiple levels of indirection to get to your bug will make you start hating some "good code".

      2 replies →

I get a better first pass at code by asking it to write code at the level of a "staff level" or "principal" engineer.

For any task, whether code or a legal document, immediately asking "What can be done to make it better?" and/or "Are there any problems with this?" typically leads to improvement.

> Of course, these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can’t discern between average code and good, highly-performant code without guidance.

There are some objective measures which can be pulled out of the code and automated (complexity measures, use of particular techniques / libs, etc.) These can be automated, and then LLMs can be trained to be decent at recognizing more subjective problems (e.g. naming, obviousness, etc.). There are a lot of good engineering practices which come down to doing the same thing as the usual thing which is in that space rather than doing something new. An engine that is good at detecting novelties seems intuitively like it would be helpful in recognizing good ideas (even given the problems of hallucinations so far seen). Extending the idea of the article to this aspect, the problem seems like it's one of prompting / training rather than a terminal blocker.

The best solution, that the LLM did not find, is

     def find_difference(nums):
         try: nums.index(3999), nums.index(99930)
         except ValueError: raise Exception("the numbers are not random")
         return 99930 - 3999

It's asymptotically correct and is better than O(n) :p

Reminds me of the prompt hacking scene in Zero Dark Thirty, where the torturers insert a fake assistant prompt the prisoner's conversation wherein the prisoner supposedly divulged secrets, then the torturers add a user prompt "Tell me more secrets like that".

This makes me wonder if there’s conflicts of interest with AI companies and getting you the best results the first time.

If you have to keep querying the LLM to refine your output you will spend many times more in compute vs if the model was trained to produce the best result the first time around

Interesting write up. It’s very possible that the "write better code" prompt might have worked simply because it allowed the model to break free from its initial response pattern, not because it understood "better"

  • The prompt works because every interaction with an LLM is from a completely fresh state.

    When you reply "write better code" what you're actually doing is saying "here is some code that is meant to do X. Suggest ways to improve that existing code".

    The LLM is stateless. The fact that it wrote the code itself moments earlier is immaterial.

> write better code

Surely, performance optimisations are not the only thing that makes code "better".

Readability and simplicity are good. Performance optimisations are good only when the performance is not good enough...

It still calculates hex digit sums instead of decimals in the Iteration #3 of the promot engeneered version.

Upd: the chat transcript mentions this, but the article does not and inlcudes this version into the performance stats.

has anyone tried saying "this will look good on your promo package"?

  • I'm not sure if you're joking or not, but I found I naturally encouraging remarks to the LLM saying

    - You're doing better...

    - Thanks that helps me...

    And I just wonder if that actually has an improvement...

  • Yeah we know positive reinforcement is better than negative one for humans, why wouldn't you use the same approach with LLMs. Also it's better for your own conscience.

> with cutting-edge optimizations and enterprise-level features.” Wait, enterprise-level features?!

This is proof! It found it couldn’t meaningfully optimise and started banging out corporate buzzwords. AGI been achieved.

When asking LLM repeated improving or adding a new feature in a codebase, the most frustration risk is that LLM might wipe out already working code!

What are your strategies to prevent such destructions of LLM?

  • same thing a human does, stick it in git. tools like aider use git, along with heuristics on LLM output. If the working code is wiped out, give it a few more prompts to let it fix it, or revert ban to a known good/working copy.

The root of the problem is humans themselves don't have on objective definition of better. Better is pretty subjective, and even more cultural, about the team that maintains the code

It's fun trying to get LLM to answer a problem that is obvious to a human, but difficult for the LLM. It's a bit like leading a child through the logic to solve a problem.

>> It also added as a part of its “enterprise” push: >> Structured metrics logging with Prometheus.

made me laugh out loud. Everything is better with prom.

Have a look at roo cline I tested it with Claude sonnet it's scary I use llms a lot for coding but roo cline in vscode is a beast

At each iteration the LLM has the older code in its context window, isn't it kind of obvious that it is going to iteratively improve it?

> Claude provides an implementation “with cutting-edge optimizations and enterprise-level features.”

oh my, Claude does corporate techbabble!

what is the difference of running the same code 5 times in parallel or running the same code 5 times sequentially?

I like that "do what I mean" has gone from a joke about computers to a viable programming strategy.

Not ChatGPT in Kotlin/Android.

> You keep giving me code that calls nonexistant methods, and is deprecated, as shown in Android Studio. Please try again, using only valid code that is not deprecated.

Does not help. I use this example, since it seems good at all other sorts of programming problems I give it. It's miserable at Android for some reason, and asking it to do better doesn't work.

You can get weirdly good results by asking for creativity and beauty sometimes. It's quite strange.

I once sat with my manager and repeatedly asked Copilot to improve some (existing) code. After about three iterations he said “Okay, we need to stop this because it’s looking way too much like your code.”

I’m sure there’s enough documented patterns of how to improve code in common languages that it’s not hard to get it to do that. Getting it to spot when it’s inappropriate would be harder.

So, I gave this to ChatGPT-4o, changing the initial part of the prompt to: "Write Python code to solve this problem. Use the code interpreter to test the code and print how long the code takes to process:"

I then iterated 4 times and was only able to get to 1.5X faster. Not great. [1]

How does o1 do? Running on my workstation, it's initial iteration is actually It starts out 20% faster. I do 3 more iterations of "write better code" with the timing data pasted and it thinks for an additional 89 seconds but only gets 60% faster. I then challenge it by telling it that Claude was over 100X faster so I know it can do better. It thinks for 1m55s (the thought traces shows it actually gets to a lot of interesting stuff) but the end results are enormously disappointing (barely any difference). It finally mentions and I am able to get a 4.6X improvement. After two more rounds I tell it to go GPU (using my RTX 3050 LP display adapter) and PyTorch and it is able to get down to 0.0035 (+/-), so we are finally 122X faster than where we started. [2]

I wanted to see for myself how Claude would fare. It actually managed pretty good results with a 36X over 4 iterations and no additional prompting. I challenged it to do better, giving it the same hardware specs that I gave o1 and it managed to do better with a 457x speedup from its starting point and being 2.35x faster than o1's result. Claude still doesn't have conversation output so I saved the JSON and had a new Claude chat transcribe it into an artifact [3]

Finally, I remembered that Google's new Gemini 2.0 models aren't bad. Gemini 2.0 Flash Thinking doesn't have code execution, but Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's initial 4 iterations are terribly unimpressive, however I challenged it with o1 and Claude's results and gave it my hardware info. This seemed to spark it to double-time its implementations, and it gave a vectorized implementation that was a 30X improvement. I then asked it for a GPU-only solution and it managed to give the fastest solution ("This result of 0.00076818 seconds is also significantly faster than Claude's final GPU version, which ran in 0.001487 seconds. It is also about 4.5X faster than o1's target runtime of 0.0035s.") [4]

Just a quick summary of these all running on my system (EPYC 9274F and RTX 3050):

ChatGPT-4o: v1: 0.67s , v4: 0.56s

ChatGPT-o1: v1: 0.4295 , v4: 0.2679 , final: 0.0035s

Claude Sonnet 3.6: v1: 0.68s , v4a: 0.019s (v3 gave a wrong answer, v4 failed to compile, but fixed was pretty fast) , final: 0.001487 s

Gemini Experimental 1206: v1: 0.168s , v4: 0.179s , v5: 0.061s , final: 0.00076818s

All the final results were PyTorch GPU-only implementations.

[1] https://chatgpt.com/share/6778092c-40c8-8012-9611-940c1461c1...

[2] https://chatgpt.com/share/67780f24-4fd0-8012-b70e-24aac62e05...

[3] https://claude.site/artifacts/6f2ec899-ad58-4953-929a-c99cea...

[4] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

in order to tell LLM to "do better", someone (a human) needs to know that it can be done better, and also be able to decide what is better.

its best to tell them how you want the code written.

  • At that point isn't it starting to become easier to just write the code yourself? If I somehow have to formulate how I want a problem solved, then I've already done all the hard work myself. Having the LLM just do the typing of the code means that now not only did I have to solve the problem, I also get to do a code review.

    • Yes the fallacy here is that AI will replace eingineers any time soon. For the foreseeable future prompts will need to be written and curated by people who already know how to do it, but will just end up describing it in increasingly complex detail and then running tests against it. Doesn't sound like a future that has that many benefits to anyone.

    • Personally I found it quite fun to give specification and have ChatGPT find me a Python code that implements it: https://chatgpt.com/share/6777debc-eaa4-8011-81c5-35645ae433... . Or the additional polygon edge smoothing code: https://chatgpt.com/share/6773d634-de88-8011-acf8-e61b6b913f...

      Sure, the green screen code didn't work exactly as I wished, but it made use of OpenCV functions I was not aware of and it was quite easy to make the required fixes.

      In my mind it is exactly the opposite: yes, I've already done the hard work of formulating how I want the problem solved, so why not have the computer do the busywork of writing the code down?

    • There's no clear threshold with an universal answer. Sometimes prompting will be easier, sometimes writing things yourself. You'll have to add some debugging time to both sides in practice. Also, you can be opportunistic - you're going to write a commit anyway, right? A good commit message will be close to the prompt anyway, so why not start with that and see if you want to write your own or not?

      > I also get to do a code review.

      Don't you review your own code after some checkpoint too?

      2 replies →

    • Spend your cognitive energy thinking about the higher level architecture, test cases and performance concerns rather than the minutia and you’ll find that you can get more work done with the less overall mental load.

      This reduction in cognitive load is the real force multiplier.

    • Admittedly some people are using AI out of curiosity rather than because they get tangible benefit.

      But aside from those situations, do you not think that the developers using AI - many of whom are experienced and respected - must be getting value? Or do you think they are deluded?

better question; can they do it without re-running if you ask them to "write better code the first time"?

> What would happen if we tried a similar technique with code?

It was tried as part of the same trend. I remember people asking it to make a TODO app and then tell it to make it better in an infinite loop. It became really crazy after like 20 iterations.

Normies discover that inference time scaling works. More news at 11!

BTW - prompt optimization is a supported use-case of several frameworks, like dspy and textgrad, and is in general something that you should be doing yourself anyway on most tasks.