Comment by fire_lake
2 days ago
Which also fits with how it performs at software engineering (in my experience). Great at boilerplate code, tests, simple tutorials, common puzzles but bad at novel and complex things.
2 days ago
Which also fits with how it performs at software engineering (in my experience). Great at boilerplate code, tests, simple tutorials, common puzzles but bad at novel and complex things.
This is also why I buy the apocalyptic headlines about AI replacing white collar labor - most white collar employment is mostly creating the same things (a CRUD app, a landing page, a business plan) with a few custom changes
Not a lot of labor is actually engaged in creating novel things.
The marketing plan for your small business is going to be the same as the marketing plan for every other small business with some changes based on your current situation. There’s no “novel” element in 95% of cases.
I don’t know if most software engineers build toy CRUD apps all day? I have found the state of the art models to be almost completely useless in a real large codebase. Tried Claude and Gemini latest since the company provides them but they couldn’t even write tests that pass after over a day of trying
Our current architectures are complex, mostly because of DRY and a natural human tendency to abstract things. But that's a decision, not a fundamental property of code. At core, most web stuff is "take it out of the database, put it on the screen. Accept it from the user, put it in the database."
If everything was written PHP3 style (add_item.php, delete_item.php, etc), with minimal includes, a chatbot might be rather good at managing that single page.
I'm saying code architected to take advantage of human skills, and code architected to take advantage of chatbot skills might be very different.
4 replies →
Agreed in general, the models are getting pretty good at dumping out new code, but for maintaining or augmenting existing code produces pretty bad results, except for short local autocomplete.
BUT it's noteworthy that how much context the models get makes a huge difference. Feeding in a lot of the existing code in the input improves the results significantly.
4 replies →
Same. Like Claude code for example will write some tests. But what they are testing is often incorrect
Most senior SWEs, no. But most technical people in software do a lot of what the parent commenter describes in my experience. At my last company there was a team of about 5 people whose job was just to make small design changes (HTML/CSS) to the website. Many technical people I've worked with over the years were focused on managing and configuring things in CMSs and CRMs which often require a bit of technical and coding experince. At the place I currently work we have a team of people writing simple python and node scripts for client integrations.
There's a lot of variety in technical work, with many modern technical jobs involving a bit of code, but not at the same complexity and novelty as the types of problems a senior SWE might be working on. HN is full of very senior SWEs. It's really no surprise people here still find LLMs to be lacking. Outside of HN I find people are far more impressed/worried by the amount of their job an LLM can do.
I agree but the reason it won’t be an apocalypse is the same reason economists get most things wrong, it’s not an efficient market.
Relatively speaking we live in a bubble, there are still broad swaths of the economy that operate with pen and paper. Another broad swath that migrated off 1980s era AS/400 in the last few years. Even if we had ASI available literally today (And we don’t) I’d give it 20-30 years until the guy that operates your corner market or the local auto repair shop has any use in the world for it.
I had predicted the same about websites, social media presence, Google maps presence etc. back 10-15 years ago, but lo and behold, even the small burger place hole-on-a-wall in rural eastern Europe is now on Google maps with reviews, and even answers by the owner, a facebook page with info on changes of opening hours etc. I'd have said there's no way that fat 60 year old guy will get up to date with online stuff.
But gradually they were forced to.
If there are enough auto repair shops that can just diagnose and process n times more cars in a day, it will absolutely force people to adopt it as well, whether they like the aesthetics or not, whether they feel like learning new things or not. Suddenly they will be super interested in how to use it, regardless of how they were boasting about being old school and hands-on beforehand.
If a technology gives enough boost to productivity, there's simply no way for inertia to hold it back, outside of the most strictly regulated fields, such as medicine, which I do expect to lag behind by some years, but will have to catch up once the benefits are clear in lower-stakes industries and there's immense demand on it that politicians will be forced to crush the doctor's cartel's grip on things.
This doesn't apply to literal ASI, mostly because copy-pasteable intelligence is an absolute gamechanger, particularly if the physical interaction problems that prevent exponential growth (think autonomous robot factory) are solved (which I'd assume a full ASI could do).
People keep comparing to other tools, but a real ASI would be an agent, so the right metaphor is not the effect of the industrial revolution on workers, but the effect of the internal combustion engine on the horse.
I wonder what the impact will be when replicating the same thing becomes machine readable with near 100% accuracy.
Definitely matches my experience as well. I've been working away on a very quirky, non-idiomatic 3D codebase, and LLMs are a mixed bag there. Y is down, there's no perspective distortion or Z buffer, there are no meshes, it's a weird place.
It's still useful to save me from writing 12 variations of x1 = sin(r2) - cos(r1) while implementing some geometric formula, but absolutely awful at understanding how those fit into a deeply atypical environment. Also have to put blinders on it. Giving it too much context just throws it back in that typical 3D rut and has it trying to slip in perspective distortion again.
Yeah I have the same experience. I’ve done some work on novel realtime text collaboration algorithms. For optimisation, I use some somewhat bespoke data structures. (Eg I’m using an order-statistic tree storing substring lengths with internal run-length encoding in the leaf nodes).
ChatGPT is pretty useless with this kind of code. I got it to help translate a run length encoded b-tree from rust to typescript. Even with a reference, it still introduced a bunch of new bugs. Some were very subtle.
It’s just not there yet but I think it will get there for translation kind of tasks quite capably in the next 12 months, especially if asked to translate a single file or a selection in a file line by line. Right now it’s quite bad which I find surprising. I have less confidence we’ll see whole-codebase or even module level understanding for novel topics in the next 24 months.
There’s also a question of quality of source data. At least in TypeScript/JavaScript land, the vast majority of code appears to be low quality and buggy or ignores important edge cases and so even when working on “boilerplate” it can produce code that appears to work but will fall over in production for 20% of users (for example string handling code that will tear Unicode graphemes like emoji).
I gotta ask what are you actually doing because it sure sounds funky
Working on extending the [Zdog](https://zzz.dog) library, adding some new types and tooling, patching bugs I run into on the way.
All the quirks inherit from it being based on (and rendering to) SVG. SVG is Y-down, Zdog only adds Z-forward. SVG only has layering, so Zdog only z-sorts shapes as wholes. Perspective distortion needs more than dead-simple affine transforms to properly render beziers, so Zdog doesn't bother.
The thing that really throws LLMs is the rendering. Parallel projection allows for optical 2D treachery, and Zdog makes heavy use of it. Spheres are rendered as simple 2D circles, a torus can be replicated with a stroked ellipse, a cylinder is just two ellipses and a line with a stroke width of $radius. LLMs struggle to even make small tweaks to existing objects/renderers.
Yep. But wonderful at aggregating details from twelve different man pages to write a shell script I didn't even know was possible to write using the system utils
I use it for this a lot.
Is it 'only' "aggregating details from twelve different man pages" or has it 'studied' (scraped) all (accessible) code in GitHub/GitLab/Stachexchange/etc. and any other publicly available coding repositories on the web (and for the case of MS the Git it owns)? Together with descriptions of what is right and what is wrong..
I use it for code, and I only do fine tuning. When I want something that is clearly never done before, I 'talk' to it and train it on which method to use, and for a human brain some suggestions/instructions are clearly obvious (use an Integer and not a Double, or use Color not Weight). So I do 'teach' it as well when I use it.
Now, I imagine that when 1 million people use LLMs to write code and fine tune it (the code), then we are inherently training the LLMs on how to write even better code.
So it's not just "..different man pages.." but "the finest coding brains (excluding mine) to tweak and train it".
[flagged]
how often are we truly writing actual novel programs that are complex in a way AI does not excel at?
There are many types of complex, and many times complex for a human coder, are trivial for AI and its skillset.
Depends on the field of development you do.
CRUD backend app for a business in a common sector? It's mostly just connecting stuff together (though I would argue that an experienced dev with a good stack takes less time to write it as is than painstakingly explaining it to an LLM in an inexact human language).
Some R&D stuff, or even debugging any kind of code? It's almost useless, as it would require deep reasoning, where these models absolutely break down.
Have you tried debugging using the new "reasoning" models yet?
I have been extremely impressed with o1, o3, o4-mini and Gemini 2.5 as debugging aids. The combination of long context input and their chain-of-thought means they can frequently help me figure out bugs that span several different layers of code.
I wrote about an early experiment with that here: https://simonwillison.net/2024/Sep/25/o1-preview-llm/
Here's a Gemini 2.5 Pro transcript from this afternoon where I'm trying to figure out a very tricky bug: https://gist.github.com/simonw/4e208ab9edb5e6a814d3d23d7570d...
1 reply →
Wait I’ve found it very good at debugging. It iteratively states a hypothesis, tries things, and reacts from what it sees.
It thinks of things that I don’t think of right away. It tries weird approaches that are frequently wrong but almost always yield some information and are sometimes spot on.
And sometimes there’s some annoying thing that having Claude bang its head against for $1.25 in API calls is slower than I would be but I can spend my time and emotional bandwidth elsewhere.
I agree with this. I do mostly DevOps stuff for work and it’s great at telling me about errors with different applications/build processes. Just today I used it to help me scrape data from some webpages and it worked very well.
But when I try to do more complicated math it falls short. I do have to say that Gemini Pro 2.5 is starting to get better in this area though.
> novel and complex things
a) What's an example?
b) Is 90% (or more) of programming mundane, and not really novel?
If you'd like a creative waste of time, make it implement any novel algorithm that mixes the idea of X with Y. It will fail miserably, double down on the failure and hard troll you, run out of context and leave you questioning why you even pay for this thing. And it is not something that can be fixed with more specific training.
Can you give an example? Have you tried it recently with the higher-end models?
7 replies →