Comment by mynameisjody

3 months ago

Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."

Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?

When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?

It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?

196 comments

mynameisjody

stavros 3 months ago

It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.

My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).

yread 3 months ago
I feel it sometimes tries to be overly correct. Like using BigInts when working with offsets in big files in javascript. My files are big but not 53bits of mantissa big. And no file APIs work with bigints. This was from Gemini 3 thinking btw
- gghffguhvc 3 months ago
  
  I just whack-a-mole these things in AGENTS.md for a while until it codes more like me.
  
  2 replies →
apwell23 3 months ago
> https://pine.town
how many prompts did it take you to make this?
how did you make sure that each new prompt didn't break some previous functionality?
did you have a precise vision for it when you started or did you just go with whatever was being given to you?
- GoatInGrey 3 months ago
  
  Judging by the site, they don't have insightful answers to these questions. It's broken with weird artifacts, errors, and amateurish console printing in PROD.
  https://i.ibb.co/xSCtRnFJ/Screenshot-2025-11-25-084709.png
  https://i.ibb.co/7NTF7YPD/Screenshot-2025-11-25-084944.png
  
  3 replies →
- stavros 3 months ago
  
  > how many prompts did it take you to make this?
  Probably hundreds, I'd say.
  > how did you make sure that each new prompt didn't break some previous functionality?
  For the backend, I reviewed the code and steered it to better solutions a few times (fewer than I thought I'd need to!). For the frontend, I only tested and steered, because I don't know much about React at all.
  This was impossible with previous models, I was really surprised that Codex didn't seem to completely break down after a few iterations!
  > did you have a precise vision
  I had a fairly precise vision, but the LLM made some good contributions. The UI aesthetic is mostly the LLM, as I'm not very good at that. The UX and functionality is almost entirely me.
  
  5 replies →
Madmallard 3 months ago
It's not really any different in my experience
- mirekrusin 3 months ago
  
  Stochastic parrot? Autocomplete on steroids? Fancy autocorrect? Bullshit generator? AI snake oil? Statistical mimicry?
  You don't hear that anymore.
  Feels like whole generation of skeptics evaporated.
  
  26 replies →
tempestn 3 months ago
Have you tried Gemini 3 yet? I haven't done any coding with it, but on other tasks I've been impressed compared to gpt 5 and Sonnet 4.5.
- joegibbs 3 months ago
  
  It's very good but it feels kind of off-the-rails in comparison to Sonnet 4.5 - at least with Cursor it does strange things like putting its reasoning in comments that are about 15 lines long, deleting 90% of a file for no real reason (especially when context is reaching capacity) and making the same error that I just told it not to do.
- Culonavirus 3 months ago
  
  The computer science field is going to be an absolute shitshow within 5 years (it already kinda is). On one side you'll have ADHD dog attention span zoomers trying out all these nth party model apis and tools every 5 seconds (switching them like socks, insisting the latest one is better, but ultimately producing the same slop) and on the other side you'll have all these applied math gurus squeezing out the last bits of usable AI compute on the planet... and nothing else.
  We used to joke that "The internet was a mistake.", making fun of the bad parts... but LLMs take the fucking cake. No intelligent beings, no sentient robots, just unlimited amounts of slop.
  The tech basically stopped evolving right around the point of it being good enough for spam and slop, but not going any further, there are no cures no new laws of physics or math or anything else being discovered by these things. All AI use in science I can see is based on finding patters in data, not intelligent thought (as in novel ideas). What a bust.
  
  21 replies →
- stavros 3 months ago
  
  Only a tiny bit, but I should. When you say GPT-5, do you mean 5.1? Codex or regular?
  
  2 replies →
- KoolKat23 3 months ago
  
  imo don't waste your time for coding with Gemini 3. Perhaps worth it if it's something Claude's not helping with, as Gemini 3's reasoning is very good supposedly.
gtirloni 3 months ago
Maybe the wtfs per line are decreasing because these models aren't saying anything interesting or original.
- stavros 3 months ago
  
  No, it's because they write correct code. Why would I want interesting code?
  
  2 replies →

Lerc 3 months ago

I guess you have a couple of options.

You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.

You could try varying tasks that perform complex things that result in easy to test things.

When I started trying chatbots for coding, one of my test prompts was

    Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.

That was about the level where some models would succeed and some will fail.

Recently I found

    Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,

Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.

These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.

It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.

taurath 3 months ago
> You could trust the expert analysis of people in that field
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
- Lerc 3 months ago
  
  The experts I am talking about trusting here are the ones doing the replication, not the ones making the claims.
- timschmidt 3 months ago
  
  That's how working with junior team members or open source project contributors goes too. Perhaps that's the big disconnect. Reviewing and integrating LLM contributions slotted right into my existing workflow on my open source projects. Not all of them work. They often need fixing, stylistic adjustments, or tweaking to fit a larger architectural goal. That is the norm for all contributions in my experience. So the LLM is just a very fast, very responsive contributor to me. I don't expect it to get things right the first time.
  But it seems lots of folks do.
  Nevertheless, style, tweaks, and adjustments are a lot less work than banging out a thousand lines of code by hand. And whether an LLM or a person on the other side of the world did it, I'd still have to review it. So I'm happy to take increasingly common and increasingly sophisticated wins.
  
  3 replies →

adamors 3 months ago

> Things I don't understand must be great?

Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.

JumpCrisscross 3 months ago

> Couple it with the tendency to please the user by all means
Why aren't foundational model companies training separate enterprise and consumer models from the get go?

apendleton 3 months ago

I think they get to that a couple of paragraphs later:

> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.

jrumbut 3 months ago

Well, that's why people still have jobs but I appreciate the idea of the post that the neat demo was a coherent paragraph or silly poem. The silly poems were all kind of similar, not very funny, and the paragraphs were a good start but I wouldn't use them for anything important.

Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.

monooso 3 months ago

The author goes into the strengths and weaknesses of the paper later in the article.

brightball 3 months ago

I keep trying out different models. Gemini 3 is pretty good. It’s not quite as good at one shotting answers as Grok but overall it’s very solid.

Definitely planning to use it more at work. The integrations across Google Workspace are excellent.

seidleroni 3 months ago

The author actually discusses the results of the paper. He's not some rando but a Wharton Professor and when he is comparing the results to a grad student, it is with some authority.

"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."

Herring 3 months ago

I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.

lm28469 3 months ago
> Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok
- daedrdev 3 months ago
  
  Education is not just a funding issues. Policy choices, like making it impossible for students to fail which means they have no incentive to learn anything, can be more impactful.
  
  19 replies →
- Izikiel43 3 months ago
  
  It's not just investing in education, it's using tools proven to work. WA spends a ton of money on education, and on reading Mississipi, the worst state for almost every metric, has beaten them. The difference? Mississipi went hard on supporting students and using phonics which are proven to work. WA still uses the hippie theory of guessing words from pictures (https://en.wikipedia.org/wiki/Whole_language) for learning how to read.
- brightball 3 months ago
  
  Investing in education is a trap because no matter how much money is pumped into the current model, it’s not making a difference.
  We need different models and then to invest in the successes, over and over again…forever.
  
  11 replies →
- acheron 3 months ago
  
  Education funding is highest in places that have the worst results. Try again.
  
  12 replies →
- Herring 3 months ago
  
  In theory yeah, but in practice 54% will also vote against funding education. Catch-22.
  
  19 replies →
- tsss 3 months ago
  
  You don't need an educated workforce if you have machines that can do it reliably. The more important question is: who will buy your crap if your population is too poor due to lack of well paying jobs? A look towards England or Germany has the answer.
  
  3 replies →
- jlawson 3 months ago
  
  Unfortunately, people are born with a certain intellectual capacity and can't be improved beyond that with any amount of training or education. We're largely hitting peoples' capacities already.
  We can't educate someone with 80 IQ to be you; we can't educate you (or I) into being Einstein. The same way we can't just train anyone to be an amazing basketball player.
  
  4 replies →
PostOnce 3 months ago
A question for the not-too-distant future:
What use is an LLM in an illiterate society?
- jcheng 3 months ago
  
  Automatic speech recognition and speech to text models are also growing up real fast.
  
  3 replies →
- AdieuToLogic 3 months ago
  
  > What use is an LLM in an illiterate society?
  The ability to feign literacy such that critical thought and ability to express same is not a prerequisite.
- throw310822 3 months ago
  
  Absurd question. The correct one is "what use is an illiterate in an LLM society".

leeoniya 3 months ago

> But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

visarga 3 months ago

You don't use it that way. You use it to help you build and run experiments, and help you discuss your findings, and in the end helps you write your discoveries. You provide the content, and actual experiments provide the signal.

ManlyBread 3 months ago
Like clockwork. Each time someone criticizes any aspect of any LLM there's always someone to tell that person they're using the LLM wrong. Perhaps it's time to stop blaming the user?
- sandspar 3 months ago
  
  If someone says that they can't get a camera to work, you tell them how to fix it, right? I can't think of what other response is appropriate.
  
  2 replies →
- matwood 3 months ago
  
  You wouldn't use a screwdriver to hammer a nail. Understanding how to use a tool is part of using the tool. It's early days and how to make the best use of these tools is still being discovered. Fortunately a lot of people are experimenting on what works best, so it only takes a little bit of reading to get more consistent results.
  
  1 reply →
- becquerel 3 months ago
  
  You can recognise that the technology has a poor user interface and is wrought with subtleties without denying its underlying capabilities. People misuse good technology all the time. It's kind of what users do. I would not expect a radically new form of computing which is under five years old to be intuitive to most people.

eckesicle 3 months ago

> It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?

It’s like the Gell-Mann amnesia effect applied to AI. :)

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

cgh 3 months ago

This is a variation of the Gell-Mann amnesia effect: https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

meindnoch 3 months ago

One could say, the GeLLMann amnesia effect. ( ͡° ͜ʖ ͡°)
nbupadhya 3 months ago

Thanks for introducing me this article

secondbreakfast 3 months ago

Loads of AI chatter is the Murray Gell-Mann Amnesia Effect on steroids

tsss 3 months ago

For what it's worth I have been using Gemini 2.5/3 extensively for my masters thesis and it has been a tremendous help. It's done a lot of math for me that I couldn't have done on my own (without days of research), suggested many good approaches to problems that weren't on my mind and helped me explore ideas quickly. When I ask it to generate entire chapters they're never up to my standard but that's mostly an issue of style. It seems to me that LLMs are good when you don't know exactly what you want or you don't care too much about the details. Asking it to generate a presentation is an utter crap shoot, even if you merely ask for bullet points without formatting.

ammbauer 3 months ago
> It's done a lot of math for me that I couldn't have done on my own (without days of research),
Isn't the point of doing the master's thesis that you do the math and research, so that you learn and understand the math and research?
- ragequittah 3 months ago
  
  I bet they were talking about how people didn't do long division when the calculator first came out too. Is using matlab and excel ok but AI not? Where do we draw the line with tools?
  
  1 reply →
- dickersnoodle 3 months ago
  
  Apparently not. This is the most perfect example I've seen of "I can recite it, but I don't understand it so I don't know if it's really right or not" that I've seen in a while.
  
  1 reply →

pojzon 3 months ago

Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.

Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.

Majority of ppl use LLMs incorrectly.

Majority of ppl selling LLMs as a panacea for everyting are lying.

But we need hype or the bubble will burst taking whole market with it, so shuushh me.

Glemkloksdjf 3 months ago

[dead]