Comment by anxoo
6 days ago
name 5 tasks which you think current AIs can't do. then go and spend 30 minutes seeing how current AIs can do on them. write it on a sticky note and put it somewhere that you'll see it.
otherwise, yes, you'll continue to be irritated by AI hype, maybe up until the point where our civilization starts going off the rails
Well, I'll try to do a sticky note here:
- they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient
- they fail at doing clean DRY practices even though they are supposed to skim through the codebase much faster than me
- they bait me into inexisting apis, or hallucinate solutions or issues
- they cannot properly pick the context and the files to read in a mid-size app
- they suggest to download some random packages, sometimes low quality ones, or unmaintained ones
"they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient"
That's mostly solved by the most recent ones that can run searches. I've had great results from o4-mini for this, since it can search for the latest updates - example here: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...
Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.
"they fail at doing clean DRY practices" - tell them to DRY in your prompt.
"they bait me into inexisting apis, or hallucinate solutions or issues" - really not an issue if you're actually testing your code! I wrote about that one here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/ - and if you're using one of the systems that runs your code for you (as promoted in tptacek's post) it will spot and fix these without you even needing to intervene.
"they cannot properly pick the context and the files to read in a mid-size app" - try Claude Code. It has a whole mechanism dedicated to doing just that, I reverse-engineered it this morning: https://simonwillison.net/2025/Jun/2/claude-trace/
"they suggest to download some random packages, sometimes low quality ones, or unmaintained ones" - yes, they absolutely do that. You need to maintain editorial control over what dependencies you add.
Thanks for the links. You mentioned 2 models in your posts, how should I proceed ? I can't possibly pay 2 subscriptions.. do you have a question for the better one to use ?
1 reply →
> Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.
See, as someone who is actually receptive to the argument you are making, sometimes you tip your hand and say things that I know are not true. I work with Gemini 2.5 a lot, and while yeah, it theoretically has a large context window, it falls over pretty fast once you get past 2-3 pages of real-world context.
> "they fail at doing clean DRY practices" - tell them to DRY in your prompt.
Likewise here. Simply telling a model to be concise has some effect, to be sure, but it's not a panacea. I tell the latest models do do all sorts of obvious things, only to have them turn around and ignore me completely.
In short, you're exaggerating. I'm not sure why.
3 replies →
[dead]
Those aren't tasks.
> they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient
This is where collaboration comes in play. If you solely rely on the LLM to “vibe code” everything, then you’re right, you get whatever it thinks is best at the time of generation. That could be wrong or outdated.
My workflow is to first provide clear requirements, generally one objective at a time. Sometimes I use an llm to format the requirements for the llm to generate code from. It then writes some code, and I review it. If I notice something is outdated I give it a link to the docs and tell it to update it using X. A few seconds later it’s made the change. I did this just yesterday when building out an integration with an api. Claude wrote the code using a batch endpoint because the steaming endpoint was just released and I don’t think it was aware of it. My role in this collaboration, is to be aware of what’s possible and how I want it to work (e.g.. being aware of the latest features and updates of the frameworks and libraries). Then it’s just about prompting and directing the llm until it works the way I want. When it’s really not working, then I jump in.
they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient
of course they can, teach them / feed them latest changes or whatever you need (much like another developer unaware of the same thing)
they fail at doing clean DRY practices even though they are supposed to skim through the codebase much faster than me
tell them it is not DRY until they make it DRY. for some (several projects I’ve been involved with) DRY is generally anti-pattern when taken to extremes (abstraction gone awry etc…). instruct it what you expect and it and watch it deliver (much like you would another developer…)
they bait me into inexisting apis, or hallucinate solutions or issues
tell it when it hallucinates, it’ll correct itself
they cannot properly pick the context and the files to read in a mid-size app
provide it with context (you should always do this anyways)
they suggest to download some random packages, sometimes low quality ones, or unmaintained ones
tell it about it, it will correct itself
Anecdotally, ChatGPT still struggles with its own API. It keeps juggling between different versions of its API and hallucinates API parameters, even when I force-feed official documents into the context (to be fair, the documentation is straight awful). Sometimes it totally refuses to change its basic assumptions, so I have to blow up the context just to make it use the up-to-date API correctly.
LLMs are stupid - nothing magic, nothing great. They’re just tools. The problem with the recent LLM craze is that people make too many obviously partially true statements.
1 reply →
> tell it when it hallucinates, it’ll correct itself
no it doesn't. Are you serious?
4 replies →
[dead]
> - they bait me into inexisting apis, or hallucinate solutions or issues
yes. this happens to me almost every time i use it. I feel like a crazy person reading all the AI hype.
I have definitely noticed these as well. Have you ever tried prompting these issues away? I'm thinking this might be a good list to add to every coding prompt
They also can’t hold copyright on their creations.
The problem with AI hype is not really about whether a particular model can - in the abstract - solve a particular programming problem. The problem with AI hype is that it is selling a future where all software development companies become entirely dependent on closed systems.
All of the state-of-the-art models are online models - you have no choice, you have to pay for a black box subscription service controlled by one of a handful of third-party gatekeepers. What used to be a cost center that was inside your company is now a cost center outside your company, and thus it is a risk to become dependent on it. Perhaps the risk is worthwhile, perhaps not, but the hype is saying that real soon now it will be impossible to not become dependent on these closed systems and still exist as a viable company.
> name 5 tasks which you think current AIs can't do.
For coding it seems to back itself into a corner and never recover from it until i "reset" it .
AI can't write software without an expert guiding it. I cannot open a non trivial PR to postgres tonight using AI.
"AI can't write software without an expert guiding it. I cannot open a non trivial PR to postgres tonight using AI."
100% true, but is that really what it would take for this to be useful today?
1. create a working (moderately complex) ghidra script without hallucinating.
Granted I was trying to do this 6 months ago, but maybe a miracle has happened. But I'm the past I had very bad experience with using LLMs for niche things (i.e. things that were never mentioned on stackoverflow)
I've never heard of Ghidra before but, in case you're interested, I ran that prompt through OpenAI's o3 and Anthropic's Claude Opus 4 for you just now (both of them the latest/greatest models from those vendors and new as of less than six months ago) - results here: https://chatgpt.com/share/683e3e38-cfd0-8006-9e49-2aa799dac4... and https://claude.ai/share/7a076ca1-0dee-4b32-9c82-8a5fd3beb967
I have no way of evaluating these myself so they might just be garbage slop.
The first one doesn't seem to actually give me the script, so I can't test it.
The second one didn't work for me without some code modification (specifically, the "count code blocks" didn't work), but the results were... not impressive.
It starts by ignoring every function that begins with "FUN_" on the basis that it's "# Skip compiler-generated functions (optional)". Sorry, but those functions aren't compiler-generated functions, they're functions that lack any symbol names, which in ghidra terms, is pretty damn common if you're reverse engineering unsymbolized code. If anything, it's the opposite of what you would want, because the named functions are the ones I've already looked at and thus give less of a guideline for interesting ones to look into next.
Looking at the results at a project I had open, it's supposed to be skipping external functions, but virtually all the top xrefs are external functions.
Finally, as a "moderately complex" script... it's not a good example. The only thing that approaches that complexity is trying to count basic blocks in a function--something that actually engages with the code model of Ghidra--but that part is broken, and I don't know Ghidra well enough to fix it. Something that would be more along the lines of "moderately complex" to me would be (to use a use case I actually have right now) for example turning the constant into a reference to that offset in the assumed data segment. Or finding all the switch statements that ghidra failed to decompile!
Everyone keeps thinking AI improvement is linear. I don't know if this is correct, but it's just my basic impression that the current AI boost came from instead of being limiting yourself to the CPU and its throughput adding the massive amount of computing power in graphics cards.
But for each nine of reliability you want out of llms everyone's assuming it's just a linear growth. I don't think it is. I think it's polynomial at least.
As for your tasks and maybe it's just cuz I'm using chat GPT, but I asked it to Port sed, something with full open source code availability, tons of examples/test cases, a fully documented user interface and I wanted it moved to Java as a library.
And it failed pretty spectacularly. Yeah it got the very very very basic functionality of sed.
Of course it didn't port sed like that. It doesn't matter that it's open source with tons of examples/test cases. It's not going to go read all the code and change it to a different language. It can pick out what sed's purpose is and it built it for you in the language you asked.
If AI can do anything, why can't I just prompt "Here is sudo access to my laptop, please do all my work for me, respond to emails, manage my household budget, and manage my meetings".
I've tried everything. I have four AI agents. They still have an accuracy rate of about 50%.
Make me a million dollars
Tell me about this specific person who isn't famous
Create a facebook clone
Recreate Windows including drivers
Create a way to transport matter like in Star Trek.
I'll see you in 6 months.