As an experienced LLM user, I don't use generative LLMs often

9 months ago (minimaxir.com)

234 comments

minimaxir

There's a thru-line to commentary from experienced programmers on working with LLMs, and it's confusing to me:

Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying.

The post does later touch on coding agents (Max doesn't use them because "they're distracting", which, as a person who can't even stand autocomplete, is a position I'm sympathetic to), but still: coding agents solve the core problem he just described. "Raw" LLMs set loose on coding tasks throwing code onto a blank page hallucinate stuff. But agenty LLM configurations aren't just the LLM; they're also code that structures the LLM interactions. When the LLM behind a coding agent hallucinates a function, the program doesn't compile, the agent notices it, and the LLM iterates. You don't even notice it's happening unless you're watching very carefully.

darepublic 9 months ago
So in my interactions with gpt, o3 and o4 mini, I am the organic middle man that copy and pastes code into the repl and reports on the output back to gpt if anything should be the problem. And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process. Maybe the llms you are using are better than the ones I tried this with?
Specifically I was researching a lesser known kafka-mqtt connector: https://docs.lenses.io/latest/connectors/kafka-connectors/si..., and o1 was hallucinating the configuration needed to support dynamic topics. The docs said one thing, and I even mentioned it to o1 that the docs contradicted with it. But it would stick to its guns. If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios -- did you spell this correctly? Responses like that indicate you've reached a dead end. I'm curious how/if the "structured LLM interactions" you mention overcome this.
- diggan 9 months ago
  
  > And for me, past a certain point, even if you continually report back problems it doesn't get any better in its new suggestions. It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.
  It sucks, but the trick is to always restart the conversations/chat with a new message. I never go beyond one reply, and also copy-paste a bunch. Got tired of copy-pasting, wrote something like a prompting manager (https://github.com/victorb/prompta) to make it easier, and not having to neatly format code blocks and so on.
  Basically make one message, if they get the reply wrong, iterate on the prompt itself and start fresh, always. Don't try to correct by adding another message, but update initial prompt to make it clearer/steer more.
  But I've noticed that every model degrades really quickly past the initial reply, no matter what length of each individual message. The companies seem to continue to increase the theoretical and practical context limits, but the quality degrades a lot faster even within the context limits, and they don't seem to try to address that (nor have a way of measuring it).
  
  16 replies →
- Grimblewald 9 months ago
  
  I refused to stop being a middle man, because I can often catch really bad implementation early and can course correct. E.g. a function which solves a problem with a series of nested loops that can be done several orders of magnitude faster by using vectorised operations offered by common packages like numpy.
  Even with all the coding agent magik people harp on about, I've never seen something that can write clean good quality code reliably. I'd prefer to tell an LLM what a functions purpose is, what kind of information and data structures it can expect and what it should output, see what it produces, provide feedback, and get a rather workable often perfect function in return.
  If i get it to write the whole thing in one go, I cannot imagine the pain of having to find out where the fuckery is that slows everything down, without diving deep with profilers etc. all for a problem I could have solved by just playing middle man and keeping a close eye on how things are building up and being in charge of ensuring the overarching vision is achieved as required.
- mr_toad 9 months ago
  
  > If I mentioned that the code wouldn't compile it would start suggesting very implausible scenarios
  I have to chuckle at that because it reminds me of a typical response on technical forums long before LLMs were invented.
  Maybe the LLM has actually learned from those responses and is imitating them.
  
  9 replies →
- nikita2206 9 months ago
  
  You can have agent search the web for documentation and then provide it to the LLM. That is how Context7 is currently very popular in the AI user crowd.
  
  2 replies →
- ericmcer 9 months ago
  
  Seriously Cursor (using Claude 3.5) does this all the time. It ends up with a pile of junk because it will introduce errors while fixing something, then go in a loop trying to fix the errors it created and slap more garbage on those.
  Because it’s directly editing code in the IDE instead of me transferring sections of code from a chat window the large amount of bad code it writes it much more apparent.
- jimbokun 9 months ago
  
  I wonder if LLMs have been seen claiming “THERE’S A BUG IN THE COMPILER!”
  A stage every developer goes through early in their development.
  
  2 replies →
- cma 9 months ago
  
  > It will just spin its wheels. So for that reason I'm a little skeptical about the value of automating this process.
  Question is would you rather find out it got stuck in a loop with 3 minutes with a coding agent or 40 minutes copy pasting. It can also get out of loops more often by being able to use tools to look up definitions with grep, ctags or language server tools, though you can copy paste commands for that too it will be much slower.
zoogeny 9 months ago

For several moments in the article I had to struggle to continue. He is literally saying "as an experienced LLM user I have no experience with the latest tools". He gives a rationale as to why he hasn't used the latest tools which is basically that he doesn't believe they will help and doesn't want to pay the cost to find out.
I think if you are going to claim you have an opinion based on experience you should probably, at the least, experience the thing you are trying to state your opinion on. It's probably not enough to imagine the experience you would have and then go with that.
AlexCoventry 9 months ago
He does partially address this elsewhere in the blog post. It seems that he's mostly concerned about surprise costs:
> On paper, coding agents should be able to address my complaints with LLM-generated code reliability since it inherently double-checks itself and it’s able to incorporate the context of an entire code project. However, I have also heard the horror stories of people spending hundreds of dollars by accident and not get anything that solves their coding problems. There’s a fine line between experimenting with code generation and gambling with code generation.
- minimaxir 9 months ago
  
  Less surprise costs, more wasting money and not getting proportionate value out of it.
usrbinbash 9 months ago

> But agenty LLM configurations aren't just the LLM; they're also code that structures the LLM interactions. When the LLM behind a coding agent hallucinates a function, the program doesn't compile, the agent notices it, and the LLM iterates.
This describes the simplest, and most benign case of code assistents messing up. This isn't the problem.
The problem is when the code does compile, but contains logical errors, security f_ckups, performance dragdowns, or missed functionality. Because none of those will be caught by something as obvious as a compiler error.
And no, "let the AI write tests" wont catch them either, because that's not a solution, that's just kicking the can dwn the road...because if we cannot trust the AI to write correct code, why would we assume that it can write correct tests for that code?
What will ultimately catch those, is the poor sod in the data center, who, at 03:00 AM has to ring the on-call engineer out of his bed, because the production server went SNAFU.
And when the oncall then has to rely on "AI" to fix the mess, because he didn't actually write the code himself, and really doesn't even know the codebase any more (or even worse: Doesn't even understand the libraries and language used at all, because he is completely reliant on the LLM doing that for him), companies, and their customers, will be in real trouble. It will be the IT equivalent of attorneys showing up in court with papers containing case references that were hallucinated by some LLM.
stefan_ 9 months ago
Have you tried it? In my experience they just go off on a hallucination loop, or blow up the code base with terrible re-implementations.
Similarly Claude 3.5 was stuck on TensorRT 8, and not even pointing it at the documentation for the updated 10 APIs for RAG could ever get it to correctly use the new APIs (not that they were very complex; bind tensors, execute, retrieve results). The whole concept of the self-reinforcing Agent loop is more of a fantasy. I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup.
- tptacek 9 months ago
  
  Yes, they're part of my daily toolset. And yes, they can spin out. I just hit the "reject" button when they do, and revise my prompt. Or, sometimes, I just take over and fill in some of the structure of the problem I'm trying to solve myself.
  I don't know about "self-reinforcing". I'm just saying: coding agents compile and lint the code they're running, and when they hallucinate interfaces, they notice. The same way any developer who has ever used ChatGPT knows that you can paste most errors into the web page and it will often (maybe even usually) come up with an apposite fix. I don't understand how anybody expects to convince LLM users this doesn't work; it obviously does work.
  
  26 replies →
- giovannibonetti 9 months ago
  
  > I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup
  This reminds me of a scene from the recent animation movie "Wallace and Gromit: Vengeance Most Fowl" where Wallace actually uses a robot (Norbot) to do gardening tasks, and rampages over Gromit's flower bed.
  https://youtu.be/_Ha3fyDIXnc
- zoogeny 9 months ago
  
  I mean, I have. I use them every day. You often see them literally saying "Oh there is a linter error, let me go fix it" and then a new code generation pass happens. In the worst case, it does exactly what you are saying, gets stuck in a loop. It eventually gets to the point where it says "let me try just once more" and then gives up.
  And when that happens I review the code and if it is bad then I "git revert". And if it is 90% of the way there I fix it up and move on.
  The question shouldn't be "are they infallible tools of perfection". It should be "do I get value equal to or greater than the time/money I spend". And if you use git appropriately you lose at most five minutes on a agent looping. And that happens a couple of times a week.
  And be honest with yourself, is getting stuck in a loop fighting a compiler, type-checker or lint something you have ever experienced in your pre-LLM days?
- mountainriver 9 months ago
  
  Have you tried it? More than once?
  I’m getting massive productivity gains with Cursor and Gemini 2.5 or Claude 3.7.
  One-shotting whole features into my rust codebase.
  
  3 replies →
- dboreham 9 months ago
  
  Someone gave me the tip to add "all source files should build without error", which you'd think would be implicit, but it seems not.
  
  1 reply →
janalsncm 9 months ago
There’s an argument that library authors should consider implementing those hallucinated functions, not because it’ll be easier for LLMs but because the hallucination is a statement about what an average user might expect to be there.
I really dislike libraries that have their own bespoke ways of doing things for no especially good reason. Don’t try to be cute. I don’t want to remember your specific API, I want an intuitive API so I spend less time looking up syntax and more time solving the actual problem.
- dayvigo 9 months ago
  
  There's also an argument that developers of new software, including libraries, should consider making an earnest attempt to do The Right Thing instead of re-implementing old, flawed designs and APIs for familiarity's sake. We have enough regression to the mean already.
  The more LLMs are entrenched and required, the less we're able to do The Right Thing in the future. Time will be frozen, and we'll be stuck with the current mean forever. LLMs are notoriously bad at understanding anything that isn't mappable in some way to pre-existing constructs.
  > for no especially good reason
  That's a major qualifier.
  
  1 reply →
vunderba 9 months ago
That sort of "REPL" system is why I really liked when they integrated a Python VM into ChatGPT - it wasn't perfect, but it could at least catch itself when the code didn't execute properly.
- tptacek 9 months ago
  
  Sure. But it's 2025 and however you want to get this feature, be it as something integrated into VSCode (Cursor, Windsurf, Copilot), or a command line Python thing (aider), or a command line Node thing (OpenAI codex and Claude Code), with a specific frontier coding model or with an abstracted multi-model thingy, even as an Emacs library, it's available now.
  I see people getting LLMs to generate code in isolation and like pasting it into a text editor and trying it, and then getting frustrated, and it's like, that's not how you're supposed to be doing it anymore. That's 2024 praxis.
  
  4 replies →
surgical_fire 9 months ago
This has been my experience with any LLM I use as a code assistant. Currently I mostly use Claude 3.5, although I sometimes use Deepseek or Gemini.
The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.
Which is why I find them them most useful to help me build things when I am very familiar with the subject matter, because at that point I can quickly spot misconceptions, errors, bugs, etc.
It's when it hits the sweet spot of being a productivity tool, really improving the speed with which I write code (and sometimes improving the quality of what I write, for sometimes incorporating good practices I was unaware of).
- steveklabnik 9 months ago
  
  > The more prominent and widely used a language/library/framework, and the more "common" what you are attempting, the more accurate LLMs tends to be. The more you deviate from mainstream paths, the more you will hit such problems.
  One very interesting variant of this: I've been experimenting with LLMs in a react-router based project. There's an interesting development history where there's another project called Remix, and later versions of react-router effectively ate it, that is, in December of last year, react-router 7 is effectively also Remix v3 https://remix.run/blog/merging-remix-and-react-router
  Sometimes, the LLM will be like "oh, I didn't realize you were using remix" and start importing from it, when I in fact want the same imports, but from react-router.
  All of this happened so recently, it doesn't surprise me that it's a bit wonky at this, but it's also kind of amusing.
  
  1 reply →
- zoogeny 9 months ago
  
  In addition to choosing languages, patterns and frameworks that the LLM is likely to be well trained in, I also just ask it how it wants to do things.
  For example, I don't like ORMs. There are reasons which aren't super important but I tend to prefer SQL directly or a simple query builder pattern. But I did a chain of messages with LLMs asking which would be better for LLM based development. The LLM made a compelling case as to why an ORM with a schema that generated a typed client would be better if I expected LLM coding agents to write a significant amount of the business logic that accessed the DB.
  My dislike of ORMs is something I hold lightly. If I was writing 100% of the code myself then I would have breezed past that decision. But with the agentic code assistants as my partners, I can make decisions that make their job easier from their point of view.
satvikpendem 9 months ago

Cursor also can read and store documentation so it's always up to date [0]. Surprised that many people I talk to about Cursor don't know about this, it's one of its biggest strengths compared to other tools.
[0] https://docs.cursor.com/context/@-symbols/@-docs
protocolture 9 months ago
>Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying.
Funnily enough I was trying to deal with some lesser used parts of pandas via LLM and it kept sending me back through a deprecated function for everything. It was quite frustrating.
- __mharrison__ 9 months ago
  
  This is because the training data for pandas code is not great. It is a lot of non programmers banging keys until it works or a bunch of newbie focused blog posts that endorse bad practices.
  
  1 reply →
namaria 9 months ago

> the program doesn't compile
How does this even make sense when the "agent" is generating Python. There are several ways it can generate code that runs and even does the thing and still has severe issues.
datpuz 9 months ago

Are you implying that you can actually let agents run loose to autonomously fix things without just creating a mess? Because that's not a thing that you can really do in real life, at least not for anything but the most trivial tasks.
__mharrison__ 9 months ago

When there's is an AI that writes Polars code correctly, please let me know.
beepbooptheory 9 months ago
How much money do you spend a day working like this?
- steveklabnik 9 months ago
  
  I haven't spent many days or full days, but when I've toyed with this, it ends up at about $10/hour or maybe a bit less.
  
  1 reply →
aerhardt 9 months ago
> the program doesn't compile
The issue you are addressing refers specifically to Python, which is not compiled... Are you referring to this workflow in another language, or by "compile" do you mean something else, such as using static checkers or tests?
Also, what tooling do you use to implement this workflow? Cursor, aider, something else?
- dragonwriter 9 months ago
  
  Python is, in fact, compiled (to bytecode, not native code); while this is mostly invisible, syntax errors will cause it to fail to compile, but the circumstances described (hallucinating a function) will not, because function calls are resolved by runtime lookup, not at compile time.
  
  4 replies →
- mountainriver 9 months ago
  
  Yes but it gets feedback from the IDE. Cursor is the best here

andy99 9 months ago

Re vibe coding, I agree with your comments but where I've used it is when I needed to mock up a UI or a website. I have no front end experience so making a 80% (probably 20%) but live demo is still a valuable thing, to show to others to get the point across, obviously not to deploy. It's a replacement for drawing a picture of what I think the UI should look like. I feel like this is an under-appreciated use. LLM coding is not remotely ready for real products but it's great for mock-ups that further internal discussions.

vunderba 9 months ago
Same. As somebody who doesn't really enjoy frontend work at all, they are surprisingly good at being able to spit out something that is relatively visually appealing - even if I'll end up rewriting the vast majority of react spaghetti code in Svelte.
- leptons 9 months ago
  
  I love front-end work, and I'm really good at it, but I now let the AI do CSS coding for me. It seems to make nice looking buttons and other design choices that are good enough for development. My designer has their own opinions, so they always will change the CSS when they get their hands on it, but at least I'm not wasting my time creating really ugly styles that always get replaced anymore. The rest of the coding is better if I do it, but sometimes the AI surprises me - though most often it gets it completely wrong, and then I'm wasting time letting it try and that just feels counterproductive. It's like a really stupid intern that almost never pays attention to what the goal is.
- NetOpWibby 9 months ago
  
  In the settings for Claude, I tell it to use Svelte and TypeScript whenever possible because I got tired of telling it I don't use React.
- mattmanser 9 months ago
  
  They're pretty good at following direction. For example you can say:
  'Usw React, typescript, materialUi, prefer functions over const, don't use unnecessary semi colons, 4 spaces for tabs, build me a UI that looks like this sketch'
  And it'll do all that.
- r0fl 9 months ago
  
  If I need a quick mockup I’ll do it in react
  But if I have time I’ll ask for it to be built using best practices from the programming language it’ll be built in as the final product whether that’s svelte, static astro, or even php
- dyauspitr 9 months ago
  
  I’m not even seeing this react spaghetti code you’re referring to. It’s pretty impressive front ends with what looks like concise, well organized code.
Tainnor 9 months ago

Tried it recently and it seems rather hit or miss. Claude 3.7 was able to generate the right business logic but the UI didn't look right at all and I couldn't fix it with further iterations. I'm no frontend dev at all and didn't know how to fix it properly, but at some point I noticed that it tried to use flexbox even though I asked for a grid. I pasted the thing into Gemini, said "rewrite it using CSS grid" and it came out correct. Maybe Gemini could have one-shotted it, from my limited trials it seems much better than Claude.
ozim 9 months ago

I find figma and the like quite annoying when I have to manage manually state.
I mostly rather adjust stuff in dev tools make screenshot of existing system then adjust anything else in ms paint.
I think LLM will be useful for getting mockups aligned with existing system quicker.
65 9 months ago

I think it would be faster/easier to use a website builder or various templating libraries to build a quick UI rather than having to babysit an LLM with prompts over and over again.

simonw 9 months ago

> However, for more complex code questions particularly around less popular libraries which have fewer code examples scraped from Stack Overflow and GitHub, I am more cautious of the LLM’s outputs.

That's changed for me in the past couple of months. I've been using the ChatGPT interface to o3 and o4-mini for a bunch of code questions against more recent libraries and finding that they're surprisingly good at using their search tool to look up new details. Best version of that so far:

"This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it."

This actually worked! https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

The other trick I've been using a lot is pasting the documentation or even the entire codebase of a new library directly into a long context model as part of my prompt. This works great for any library under about 50,000 tokens total - more than that and you usually have to manually select the most relevant pieces, though Gemini 2.5 Pro can crunch through hundreds of thousands of tokens pretty well with getting distracted.

Here's an example of that from yesterday: https://simonwillison.net/2025/May/5/llm-video-frames/#how-i...

zoogeny 9 months ago
I think they might have made a change to Cursor recently as well. A few times I've caught it using an old API of popular libraries that have updates. Shout out to all the library developers that are logging deprecations and known incorrect usages, that has been a huge win with LLMs. In most cases I can paste the deprecation warning back into the agent and it will say "Oh, looks like that API changed in vX.Y.Z, we should be doing <other thing>, let me fix that ..."
So it is capable of integrating new API usage, it just isn't a part of the default "memory" of the LLM. Given how quickly JS libraries tend to change (even on the API side) that isn't ideal. And given that the typical JS server project has dozens of libs, including the most recent documentation for each is not really feasible. So for now, I am just looking out for runtime deprecation errors.
But I give the LLM some slack here, because even if I was programming myself using an library I've used in the past, I'm likely to make the same mistake.
- satvikpendem 9 months ago
  
  You can just use @Docs [0] to import the correct documentation for your libraries.
  [0] https://docs.cursor.com/context/@-symbols/@-docs
  
  1 reply →
liamwire 9 months ago

I’ve had similar experiences in recent weeks, enabling search in ChatGPT with o4-mini-high to fix previously insurmountable hurdles I’d run into wrt breaking changes in libraries that have occurred outside the model cutoff dates, and for which the errors and fixes are non-obvious. It’s worked far better than I’d expected for what feels like such a simple UI toggle.

fumeux_fume 9 months ago

> Over the years I’ve utilized all the tricks to get the best results out of LLMs.

The poster tips their hand early in the article and I can see there won't be much substance here. I work on writing prompts for production solutions that use LLMs to QA various text inputs that would be very hard to do using traditional NLP techniques. Good prompt engineering had very little to do with thinking up ridiculous scenarios to "trick" the LLM into being better. Those are actually counterproductive because their efficacy can vary widely across model versions.

eclecticfrank 9 months ago

Can you give a concise example of good prompt engineering in your case?

ziml77 9 months ago

I like that the author included the chat logs. I know there's a lot of times where people can't share them because they'd expose too much info, but I really think it's important when people make big claims about what they've gotten an LLM to do that they back it up.

minimaxir 9 months ago
That is a relatively new workflow for me since getting the logs out of the Claude UI is a more copy/paste manual process. I'm likely going to work on something to automate it a bit.
- simonw 9 months ago
  
  I use this:
  llm -m claude-3.7-sonnet "prompt" llm logs -c | pbcopy
  Then paste into a Gist. Gets me things like this: https://gist.github.com/simonw/0a5337d1de7f77b36d488fdd7651b...
  
  2 replies →

behnamoh 9 months ago

> I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary.

Yes, I also often use the "studio" of each LLM for better results because in my experience OpenAI "nerfs" models on the ChatGPT UI (models keep forgetting things—probably a limited context length set by OpenAI to reduce costs, generally the model is less chatty (again, probably to reduce their costs), etc. But I've noticed Gemini 2.5 Pro is the same on the studio and the Gemini app.

> Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where...

ChatGPT does have system prompts but Claude doesn't (one of its many, many UI shortcomings which Anthropic never addressed).

That said, I've found system prompts less and less useful with newer models. I can simply preface my own prompt with the instructions and the model follows them very well.

> Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com.

I get that LLMs have a vague idea of how many words are 30 words, but they never do a good job in these tasks for me.

geor9e 9 months ago

>Ridiculous headline implying the existance of non-generative LLMs

>Baited into clicking

>Article about generative LLMs

>It's a buzzfeed employee

minimaxir 9 months ago
The reason I added "generative" to the headline is because of the mention about the important use of text embedding models that are indeed non-generative LLMs and I did not want to start a semantics war if I did not explicitly specify "generative LLMs." (the headline would flow better without it)
- geor9e 9 months ago
  
  Oooh, that makes more sense. To be honest, I only started reading the article with the sole mission of answering the question: what the heck is a non-generative LLM? I had never heard of that. I thought it would be answered explicitly in the first few sentences. As sentence after sentence passed and it didn't even seem to acknowledge the tension of that elephant in the room, my frustration grew beyond what I could concentrate on, and I found it difficult to continue reading, assuming I'd been clickbaited by slop or something. But now that I have that context it makes it a lot more readable.

Jerry2 9 months ago

> I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality

Hey Max, do you use a custom wrapper to interface with the API or is there some already established client you like to use?

If anyone else has a suggestion please let me know too.

simonw 9 months ago
I'm going to plug my own LLM CLI project here: I use it on a daily basis now for coding tasks like this one:
llm -m o4-mini -f github:simonw/llm-hacker-news -s 'write a new plugin called llm_video_frames.py which takes video:path-to-video.mp4 and creates a temporary directory which it then populates with one frame per second of that video using ffmpeg - then it returns a list of [llm.Attachment(path="path-to-frame1.jpg"), ...] - it should also support passing video:video.mp4?fps=2 to increase to two frames per second, and if you pass ?timestamps=1 or &timestamps=1 then it should add a text timestamp to the bottom right conner of each image with the mm:ss timestamp of that frame (or hh:mm:ss if more than one hour in) and the filename of the video without the path as well.' -o reasoning_effort high
Any time I use it like that the prompt and response are logged to a local SQLite database.
More on that example here: https://simonwillison.net/2025/May/5/llm-video-frames/#how-i...
- dcre 9 months ago
  
  Seconded, everyone should be using CLIs.
  Simon: while that example is impressive, it is also complicated and hard to read in an HN comment.
minimaxir 9 months ago

I was developing an open-source library for interfacing with LLMs agnostically (https://github.com/minimaxir/simpleaichat) and although it still works, I haven't had the time to maintain it unfortunately.
Nowadays for writing code to interface with LLMs, I don't use client SDKs unless required, instead just hitting HTTP endpoints with libraries such as requests and httpx. It's also easier to upgrade to async if needed.
asabla 9 months ago

most services has a "studio mode" for their models served.
As an alternative you could always use OpenWebUI
danenania 9 months ago
I built an open source CLI coding agent for this purpose[1]. It combines Claude/Gemini/OpenAI models in a single agent, using the best/most cost effective model for different steps in the workflow and different context sizes. You might find it interesting.
It uses OpenRouter for the API layer to simplify use of APIs from multiple providers, though I'm also working on direct integration of model provider API keys—should release it this week.
1 - https://github.com/plandex-ai/plandex

qoez 9 months ago

I've tried it out a ton but the only thing I end up using it for these days is teaching me new things (which I largely implement myself; it can rarely one-shot it anyway). Or occasionally to make short throwaway scripts to do like file handling or ffmpeg.

rfonseca 9 months ago

This was an interesting quote from the blog post: "There is one silly technique I discovered to allow a LLM to improve my writing without having it do my writing: feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post."

vunderba 9 months ago

I do a good deal of my blog posts while walking my husky and just dictating using speech-to-text on my phone. The problem is that its an unformed blob of clay and really needs to be shaped on the wheel.

I then feed this into an LLM with the following prompt:

  You are a professional editor. You will be provided paragraphs of text that may 
  contain spelling errors, grammatical issues, continuity errors, structural 
  problems, word repetition, etc. You will correct any of these issues while 
  still preserving the original writing style. Do not sanitize the user. If they 
  use profanities in their text, they are used for emphasis and you should not 
  omit them. 

  Do NOT try to introduce your own style to their text. Preserve their writing 
  style to the absolute best of your ability. You are absolutely forbidden from 
  adding new sentences.

It's basically Grammarly on steroids and works very well.

meowzero 9 months ago

I do something similar. But I make sure the LLM doesn't know I wrote the post. That way the LLM is not sycophantic.
kixiQu 9 months ago

What roleplayed feedback providers have people had best and worst luck with? I can imagine asking for the personality could help the LLM come up with different kinds of criticisms...
jdnier 9 months ago

Make sure the check out the LLM chat log he shared for the current article! The tone is cynical HN commenter indeed.
https://github.com/minimaxir/llm-use/blob/main/criticism_hn....

Oras 9 months ago

JSON response doesn’t always work as expected unless you have few items to return. In Max’s example it’s classification.

For anyone trying to return consistent json, checkout structured data where you define a json schema with required field and that would return the same structure all the time.

I have tested it with high success using GPT-4o-mini.

danbrooks 9 months ago

As a data scientist, this mirrors my experience. Prompt engineering is surprisingly important for getting expected output - and use LLM POCs have quick turnaround times.

Beijinger 9 months ago

"To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary."

How do you do this? Do you have to be on a paid plan for this?

diggan 9 months ago

I think they're talking about the Sandbox/Playground/Editor thingy that almost all companies who expose APIs also offer to quickly try out API features. For OpenAI it's https://platform.openai.com/playground/prompts?models=gpt-4...., Anthropic has https://console.anthropic.com/workbench and so on.
minimaxir 9 months ago

If you log into the API backend, there is usually a link to the UI. For OpenAI/ChatGPT, it's https://platform.openai.com/playground
This is independent of ChatGPT+. You do need to have a credit card attached but you only pay for your usage.

cbeach 9 months ago

It's a shame about the title of this HN post, as the article itself is a really interesting exploration of the use of machine learning in news media.

I nearly skipped it as it sounded like the "I'm too good for LLMs" engagement-bait that plagues LinkedIn at the moment.

Snuggly73 9 months ago

Emmm... why has Claude 'improved' the code by setting SQLite to be threadsafe and then adding locks on every db operation? (You can argue that maybe the callbacks are invoked from multiple threads, but they are not thread safe themselves).

dboreham 9 months ago
Interns don't understand concurrency either.
- jayd16 9 months ago
  
  Listen, I can fail to understand your questions for half the price you're paying the LLM.
- daxfohl 9 months ago
  
  But if you teach them the right way to do it today and have them fix it, they won't go and do it the wrong way again tomorrow and the next day amd every day for the rest of the summer.
  
  1 reply →

PeterStuer 9 months ago

"ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post"

Now tell me again AI isn't coming for our jobs.

tomrod 9 months ago

Your experience mirrors my use cases. There are certainly a ton of great uses for LLMs. VibeCoding has way less vibe and way more structure to be successful.

osigurdson 9 months ago

>> I’ve asked LLMs to help me write regular expressions

Maybe if we had a more sane regex syntax, LLMs wouldn't be needed to construct them.

lxe 9 months ago

This article reads like "I'm not like other LLM users" tech writing. There are good points about when LLMs are actually useful vs. overhyped, but the contrarian framing undermines what could have been straightforward practical advice. The whole "I'm more discerning than everyone else" positioning gets tiresome in tech discussions, especially when the actual content is useful.

minimaxir 9 months ago
I was not explicitly intending to be contrarian, but unfortunately the contrarian framing is inevitable when the practical advice is counterintuitive and against modern norms. I was second-guessing publishing this article at all because "I don't use ChatGPT.com" and "I don't see a use for Agents/MCP/Vibe coding" are both statements that are potentially damaging to my career as an engineer, but there's no point in writing if I can't be honest.
Part of the reason I've been blogging about LLMs for so long is that a lot of it is counterintuitive (which I find interesting!) and there's a lot of misinformation and suboptimal workflows that results from it.
- mattgreenrocks 9 months ago
  
  > "I don't use ChatGPT.com" and "I don't see a use for Agents/MCP/Vibe coding" are both statements that are potentially damaging to my career as an engineer
  This is unfortunate, though I don't blame you. Tech shouldn't be about blind faith in any particular orthodoxy.
- kayodelycaon 9 months ago
  
  Tone and word choice is actually the problem here. :)
  One example: “normal-person frontends” immediately makes the statement a judgement about people. You could have said regular, typical, or normal instead of “normal-person”.
  Saying your coworkers often come to you to fix problems and your solutions almost always work can come off as saying you’re more intelligent than your coworkers.
  The only context your readers have are the words you write. This makes communication a damned nuisance because nobody knows who you are and they only know about you from what they read.
  
  1 reply →
- lxe 9 months ago
  
  Your defense of the contrarian framing feels like it's missing the point. What you're describing as "counterintuitive" is actually pretty standard for anyone who's been working deeply with LLMs for a while.
  Most experienced LLM users already know about temperature controls and API access - that's not some secret knowledge. Many use both the public vanilla frontends and specialized interfaces (various HF workflows, custom setups, sillytavern, oobabooga (̵r̵i̵p̵)̵, ollama, lmstudio, etc) depending on the task.
  Your dismissal of LLMs for writing comes across as someone who scratched the surface and gave up. There's an entire ecosystem of techniques for effectively using LLMs to assist writing without replacing it - from ideation to restructuring to getting unstuck on specific sections.
  Throughout the article, you seem to dismiss tools and approaches after only minimal exploration. The depth and nuance that would be evident to anyone who's been integrating these tools into their workflow for the past couple years is missing.
  Being honest about your experiences is valuable, but framing basic observations as contrarian insights isn't counterintuitive - it's just incomplete.
  
  4 replies →

nicman23 9 months ago

you are using llms for anything more than generating boilerplate, you are usually wasting your time

exhaze 9 months ago
Food for thought, a snippet from a highly specialized project I created two months ago:
https://gist.github.com/eugene-yaroslavtsev/c9ce9ba66a7141c5...
I spent several hours searching online for existing solutions - couldn't find anything (even when exploring the idea of stitching together multiple different tools, each in a different programming language).
This took me ~3-4 hours end-to-end. I haven't seen any other OSS code that is able to handle converting unstructured JSON into normalized, structured JSON with a schema, while also using a statistical sampling sliding window method for handling for all these:
- speculative SIMD prediction of end of current JSON entry - distinguishing whether two "similar" looking objects represent the same model or not - normalizing entities based on how often they're referenced - ~5-6 GB/s throughput on a Macbook M4 Max 24GB - arbitrary horizontal scaling (though shared entity/normalization resource contention may eventually become an issue)
I didn't write this code. I didn't even come up with all of these ideas in this implementation. I initially just thought "2NF"/"BNF" probably good, right? Not for multi-TB files.
This was spec'd out by chatting with Sonnet for ~1.5 hours. It was the one that suggested statistical normalization. It suggested using several approaches for determining whether two objects are the same schema (that + normalization were where most of the complexity decided to live).
I did this all on my phone. With my voice.
I hope more folks realize this is possible. I strongly encourage you and others reconsider this assumption!
- victorNicollet 9 months ago
  
  The snippet you shared is consistent with the kind of output I have also been seeing out of LLMs: it looks correct overall, but contains mistakes and code quality problems, both of which would need human intervention to fix.
  For example, why is the root object's entityType being passed to the recursive mergeEntities call, instead of extracting the field type from the propSchema?
  Several uses of `as` (as well as repeated `result[key] === null`) tests could be eliminated by assigning `result[key]` to a named variable.
  Yes, it's amazing that LLMs have reached the level where they can produce almost-correct, almost-clean code. The question remains of whether making it correct and clean takes longer than writing it by hand.

morgengold 9 months ago

... but when I do, I let it write regex, SQL commands, simple/complex if else stuff, apply tailwind classes, feed it my console log errors, propose frontend designs ... and other little stuff. Saves brain power for the complex problems.

iambateman 9 months ago

> "feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post."

It feels weird to write something positive here...given the context...but this is a great idea. ;)

jaggederest 9 months ago
This is the kind of task where, before LLMs, I wouldn't have done it. Maybe if it was something really important I'd circulate it to a couple friends to get rough feedback, but mostly it was just let it fly. I think it's pretty revolutionary to be able to get some useful feedback in seconds, with a similar knock-on effect in the pull request review space.
The other thing I find LLMs most useful for is work that is simply unbearably tedious. Literature reviews are the perfect example of this - Sure, I could go read 30-50 journal articles, some of which are relevant, and form an opinion. But my confidence level in letting the AI do it in 90 seconds is reasonable-ish (~60%+) and 60% confidence in 90 seconds is infinitely better than 0% confidence because I just didn't bother.
A lot of the other highly hyped uses for LLMs I personally don't find that compelling - my favorite uses are mostly like a notebook that actually talks back, like the Young Lady's Illustrated Primer from Diamond Age.
- barbazoo 9 months ago
  
  > But my confidence level in letting the AI do it in 90 seconds is reasonable-ish (~60%+) and 60% confidence in 90 seconds is infinitely better than 0% confidence because I just didn't bother.
  So you got the 30 to 50 articles summarized by the LLM, now how do you know what 60% you can trust and what’s hallucinated without reading it? It’s hard to be usable at all unless you already do know what is real and what is not.
  
  1 reply →

gcp123 9 months ago

While I think the title is misleading/clickbaity (no surprise given the buzzfeed connection), I'll say that the substance of the article might be one of the most honest take on LLMs I've seen from someone who actually works in the field. The author describes exactly how I use LLMs - strategically, for specific tasks where they add value, not as a replacement for actual thinking.

What resonated most was the distinction between knowing when to force the square peg through the round hole vs. when precision matters. I've found LLMs incredibly useful for generating regex (who hasn't?) and solving specific coding problems with unusual constraints, but nearly useless for my data visualization work.

The part about using Claude to generate simulated HN criticism of drafts is brilliant - getting perspective without the usual "this is amazing!" LLM nonsense. That's the kind of creative tool use that actually leverages what these models are good at.

I'm skeptical about the author's optimism regarding open-source models though. While Qwen3 and DeepSeek are impressive, the infrastructure costs for running these at scale remain prohibitive for most use cases. The economics still don't work.

What's refreshing is how the author avoids both the "AGI will replace us all" hysteria and the "LLMs are useless toys" dismissiveness. They're just tools - sometimes useful, sometimes not, always imperfect.

xandrius 9 months ago
Just about the point about the prohibitive infrastructure at scale, why does it need to be at scale?
Over few years, we went from literally impossible to being able to run a 72B model locally on a laptop. Give it 5-10 years and we might not need to have any infrastructure at all, all served locally with switchable (and different sized) open source models.
- imhoguy 9 months ago
  
  Exactly open-source doesn't need scale in terms of user base. You can easily infer with hundreds B parameter models on pay as you go infa for a few dolars, or just build commodity rig for a few thousand. That is affordable for SMEs or even devoted hobbysts. The most important part about opensource is democratization of LLM access.
walterbell 9 months ago
> While Qwen3 and DeepSeek are impressive, the infrastructure costs for running these at scale remain prohibitive for most use cases. The economics still don't work
dedicated LLM hosting providers like Cerebras and Groq who can actually make money on each user inference query
Cerebras (wafer-scale) and Groq (TPU+) both have inference-optimized custom hardware.

ttoinou 9 months ago

Side topic : I didn’t see a serious article about prompt engineering for senior software development pop up on HN. Yet a lot of users here have their own techniques unshared with others

simonw 9 months ago
It's maddening to me how little good writing there is out there on effective prompting.
Here's an example: what's the best prompt to use to summarize an article?
That feels like such an obvious thing, and yet I haven't even seen that being well explored.
It's actually a surprisingly deep topic. I like using tricks like "directly quote the sentences that best illustrate the overall themes" and "identify the most surprising ideas", but I'd love to see a thorough breakdown of all the tricks I haven't seen yet.
- Gracana 9 months ago
  
  This whitepaper is the best I've found so far, at least for covering general prompting techniques.
  https://www.kaggle.com/whitepaper-prompt-engineering
- jimbokun 9 months ago
  
  This comment has some advice from Simon Willison on this topic:
  https://news.ycombinator.com/item?id=43897666
  Maybe you should ask him. lol
  
  1 reply →
- jefflinwood 9 months ago
  
  It seems a little counterintuitive, but you can ask an LLM to improve a prompt. They are quite good at it.
  
  1 reply →
mtlynch 9 months ago
I was just listening to Simon Willison on Software Misadventures,[0, 1] and he said the best resource he knows of is Anthropic's guide to prompt engineering.[2]
[0] https://softwaremisadventures.com/p/simon-willison-llm-weird...
[1] https://youtu.be/6U_Zk_PZ6Kg?feature=shared&t=56m29 (the exact moment)
[2] https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
- walterbell 9 months ago
  
  Anthropic prompt engineering video (Sep 2024), https://www.youtube.com/watch?v=T9aRN5JkmL8
minimaxir 9 months ago
An unfortunate casualty of the importantance of prompt engineering is that there is less of an incentive to share good prompts publicly.
- danenania 9 months ago
  
  You can read all of the prompts in Plandex (open source CLI coding agent focusing on larger projects): https://github.com/plandex-ai/plandex/tree/main/app/server/m...
  They're pretty long and extensive, and honestly could use some cleaning up and refactoring at this point, but they are being used heavily in production and work quite well, which took a fairly extreme amount of trial-and-error to achieve.

justlikereddit 9 months ago

[flagged]

Philpax 9 months ago
Well, you certainly live up to your username.
- justlikereddit 9 months ago
  
  As does all the replies and votewankers here

thefourthchime 9 months ago

[flagged]

jrflowers 9 months ago

[flagged]

resource_waste 9 months ago

[flagged]

Legend2440 9 months ago

>Discourse about LLMs and their role in society has become bifuricated enough such that making the extremely neutral statement that LLMs have some uses is enough to justify a barrage of harrassment.

Honestly true and I’m sick of it.

A very vocal group of people are convinced AI is a scheme by the evil capitalists to make you train your own replacement. The discussion gets very emotional very quickly because they feel personally threatened by the possibility that AI is actually useful.

olddustytrail 9 months ago

Should they not feel threatened? I'm somewhat sympathetic to the view that even the current state of the art is threatening to people's livelihood.
And of course it will only become more powerful. It's a dangerous game.
mns 9 months ago

I don't know, this whole thread seems more like people explaining how they spend more time trying to dance around with prompts and ask the LLMs to do a task than actually doing something themselves faster and better.
n_ary 9 months ago

> people are convinced AI is a scheme by the evil capitalists to make you train your own replacement. The discussion gets very emotional very quickly because they feel personally threatened by the possibility that AI is actually useful.
Are not mutually exclusive. LLMs will train people's replacements while same people pay for the privilege of training those replacement. LLMs also allows me to auto-complete a huge volume of boilerplate, which would take me several hours. It also helps people step out of writer's block, generate a first draft of prototype/mvp/poc etc quickly without wasting long hours bike shedding. It also helps my previously super confident cousin, who blamed me for killing his dream of next AirBnB for dogs, Uber for groceries, instagram for cats not materializing due to me being selfish hoarding my privileges and knowledge, to finally create those ideas and kill his own dreams and definitely ignoring/avoiding me these days.
LLMs are same as knives, crimes will happen, but also necessary in the kitchen and industries.
bluefirebrand 9 months ago
> A very vocal group of people are convinced AI is a scheme by the evil capitalists to make you train your own replacement. The discussion gets very emotional very quickly because they feel personally threatened by the possibility that AI is actually useful
I read this like you are framing this as though it is irrational. However, history is littered with examples of capitalists replacing labour with automation and using any productivity gains of new technology to push salaries lower
Of course people who see this playing out again are personally threatened. If you aren't feeling personally threatened, you are either part of the wealthy class or for some reason you think this time will be different somehow
You may be thinking "Even if I lose my job to automation, there will be other work to do like piloting the LLMs", but you should know that the goal is to eventually pay LLM operators peanuts in comparison to what you currently make in whatever role you do
- skybrian 9 months ago
  
  Setting up automation as the enemy is an odd thing for programmers to be doing. I mean, if you’re a programmer and you’re not automating away tasks, both for yourself and other people, what are you even doing?
  Also, “this time it’s different” depends on the framing. A cynical programmer who has seen new programming tools hyped too many times would make a different argument: At the beginning of the dot-com era, you could get a job writing HTML pages. That’s been automated away, so you need more skill now. It hasn’t resulting in fewer software-engineering jobs so far.
  But that’s not entirely convincing either. Predicting the future is difficult. Sometimes the future is different. Making someone else’s scenario sound foolish won’t actually rule anything out.
  
  1 reply →
- olddustytrail 9 months ago
  
  What happens when you use an LLM to generate prompts for an LLM?
- Legend2440 9 months ago
  
  >history is littered with examples of capitalists replacing labour with automation and using any productivity gains of new technology to push salaries lower
  Nonsense. We make far, far more than people did in the past entirely because of the productivity gains from automation.
  The industrial revolution led to the biggest increase in quality of life in history, not in spite of but because it automated 90% of jobs. Without it we'd all still be subsistence farmers.
  
  5 replies →