Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
I feel it sometimes tries to be overly correct. Like using BigInts when working with offsets in big files in javascript. My files are big but not 53bits of mantissa big. And no file APIs work with bigints. This was from Gemini 3 thinking btw
You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.
You could try varying tasks that perform complex things that result in easy to test things.
When I started trying chatbots for coding, one of my test prompts was
Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.
That was about the level where some models would succeed and some will fail.
Recently I found
Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,
Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.
These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.
It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.
> You could trust the expert analysis of people in that field
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
I think they get to that a couple of paragraphs later:
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
Well, that's why people still have jobs but I appreciate the idea of the post that the neat demo was a coherent paragraph or silly poem. The silly poems were all kind of similar, not very funny, and the paragraphs were a good start but I wouldn't use them for anything important.
Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.
The author actually discusses the results of the paper. He's not some rando but a Wharton Professor and when he is comparing the results to a grad student, it is with some authority.
"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."
I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
You don't use it that way. You use it to help you build and run experiments, and help you discuss your findings, and in the end helps you write your discoveries. You provide the content, and actual experiments provide the signal.
Like clockwork. Each time someone criticizes any aspect of any LLM there's always someone to tell that person they're using the LLM wrong. Perhaps it's time to stop blaming the user?
For what it's worth I have been using Gemini 2.5/3 extensively for my masters thesis and it has been a tremendous help. It's done a lot of math for me that I couldn't have done on my own (without days of research), suggested many good approaches to problems that weren't on my mind and helped me explore ideas quickly. When I ask it to generate entire chapters they're never up to my standard but that's mostly an issue of style. It seems to me that LLMs are good when you don't know exactly what you want or you don't care too much about the details. Asking it to generate a presentation is an utter crap shoot, even if you merely ask for bullet points without formatting.
Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
It is interesting that most of our modes of interaction with AI is still just textboxes. The only big UX change in that the last three years has been the introduction of the Claude Code / OpenAI Codex tools. They feel amazing to use, like you're working with another independent mind.
I am curious what the user interfaces of AI in the future will be, I think whoever can crack that will create immense value.
Text is very information-dense. I'd much rather skim a transcript in a few seconds than watch a video.
There's a reason keyboards haven't changed much since the 1860s when typewriters were invented. We keep coming up with other fun UI like touchscreens and VR, but pretty much all real work happens on boring old keyboards.
I’ve been using ChatGPT Atlas since release on my personal laptop. I very often have it generate a comprehensive summary for YouTube videos, so I don’t have to sit there and watch/scrub a half hour video when a couple of pages of text contains the same content.
The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
And anyone that has ever tried to talk to Siri or Alexa would prefer a keyboard for anything but the most simple questions. I don't think that will change for a long time if ever. The lack of errors and being able to say exactly what you want is so valuable.
No matter how good a keyboard we might be able to invent it'll always be slower than a direct brain interface, and we have those, in a highly experimental way, now.
One day we will look back at improvements to keyboards and touchscreens as the 'faster horse' of the physical interface era.
Unix CLI utilities have been all text for 50 years. Arguably that is why they are still relevant. Attempts to impose structured data on the paradigm like those in PowerShell have their adherents and can be powerful, but fail when the data doesn't fit the structure.
We see similar tendency toward the most general interfaces in "operator mode" and similar the-AI-uses-the-mouse-and-keyboard schemes. It's entirely possible for every application to provide a dedicated interface for AI use, but it turns out to be more powerful to teach the AI to understand the interfaces humans already use.
PowerShell is completely suitable. People are just used to bash and don’t feel the incentive to switch, especially with Windows becoming less relevant outside of desktop development.
Yet the most popular platforms on the planet have people pointing a finger (or several) at a picture.
And the most popular media format on the planet is and will be (for the foreseeable future), video. Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums (and I say this as someone who loves forums).
GenAI will definitely need better UIs for the kind of universal adoption (think smartphone - 8/9 billion people).
When we have really fast and good models it will be able to generate a GUI on the fly. It could probably be done now with a fine-tune on some kind of XML-based UI schema or something. I gave it a try but couldn't figure it out entirely, consistency would be an issue too.
I agree i think specifically the world is multi modal. Getting a chat to be truly multi modal .i.e interacting with different data types and text in an unified way is going to be the next big thing. Mainly given how robotics is taking off 3d might be another important aspect to it. At vlm.run we are trying to make this possible how to combine VLM's and LLM's in a seem less way to get the best UI. https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7
Personally I find the information density of text to be the "killer feature". I've tried voice interaction (even built some AI Voice Agents) and while they are very powerful, easy to use and just plain cool, they are also slow.
Nothing beats skimming over a generated text response and just picking out chunks of text, going back and forth, rereading, etc.
Text is also universal, I can't copy-paste a voice response to another application/interface or iterate over it.
My personal view is that the search for a better AI User Interface is just the further dumbing down of the humans who use these interface. Another comment mentioned that the most popular platforms are people pointing fingers at pictures and without a similar UI/UX AI would never reach such adoption rates, but is that what we want? Monkeys pointing at colorful picture blobs?
People get a little too hung up on finding the AI UI. It does not seem all necessary that the interfaces will be much different (while the underlying tech certainly will be).
Text and boxes and tables and graphs is what we can cope with. And while the AI is going to change much, we are not.
I get what you’re saying here, and you’re right that other UIs will be a big deal in the near future… but I don’t think it’s fair to say “just” textboxes.
This is HN. A lot of us work remotely. Speaking for myself, I much prefer to communicate via Slack (“just a textbox”) over jumping into a video call. This is especially true with technical topics, as text is both more dense and far more clear than speech in almost all cases.
Grok has been integrated into Tesla vehicles, and I've had several voice interactions with it recently. Initially, I thought it was just a gimmick, but the voice interactions are great and quite responsive. I've found myself using it multiple times to get updates on the news or quick questions about topics I'm interested in.
If you are interested in UX a youtube series I found enjoyable and thought provoking is "liber indigo" (sorry, on mobile)
What comes after the desktop metaphor and mobile? There is VR but... no one is sure it will get anywhere. It's cool but probably won't supplant tradition.
Maybe the ability of AI to accept somewhat imprecise inputs will help us get away from text. Multimodal gesture, voice, and touch perhaps?. So we would all be sort of body acting like players on a stage, in order to convey to a machine what direction you wish to turn its attention
Ooooh, it bothers me, so, so, so much. Too perky. Weirdly casual. Also, it's based on the old 4o code - sycophancy and higher hallucinations - watch out. That said, I too love the omni models, especially when they're not nerfed. (Try asking for a Boston, New York, Parisian, Haitian, Indian and Japanese accent from 4o to explore one of the many nerfs they've done since launch)
> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.
I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
Disappointingly, that is an exceedingly good story for a high school assignment. The use of an appositive phrase alone would raise alarm bells though.
It's nitpicking for flaws, but why not -- what lens on an old DSLR, older than a car, will let you take a macro shot, a wide shot, and a zoom shot of a bird?
In any case I'm not surprised. It's a short story, and it is indeed _serviceable_, but literature is more than just service to an assignment.
> But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.
I feel like I've been hearing this for at least 1.5 years at this point (since the launch of GPT 4/Claude 3). I certainly agree we've been heading in this direction but when will this become unambiguously true rather than a phrase people say?
i don't imagine there will ever be a time when it will be unambiguously true, any more than a boss could ever really unambigously say their job is "manager who directs subordinates" vs "manager who fixes subordinates' mistakes".
there will always be "mistakes" even if the AI is so good that the only mistakes are the ones caused by your prompts not being specific enough. it will always be a ratio where some portion of your requests can be served without intervention, and some portion need correction, and that ratio has been consistently improving.
There's no bright line - you should download some cli tools, hook up some agents to them and see what you think. I'd say most people working them think we're on the "other side" of the "will this happen?" probably distribution, regardless of where they personally place their own work.
> So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.
As a current graduate student, I have seen similar comments in academia. My colleagues agree that a conversation with these recent models feels like chatting with an expert in their subfields. I don't know if it represents research as a field would not be immune to advances in AI tech. I still hope this world values natural intelligence and having the drive to do things heavily than a robot brute-forcing into saying "right" things.
> if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.
With coding it feels more like working with two devs - one is a competent intermediate level dev, and one is a raving lunatic with zero critical thinking skills whatsoever. Problem is you only get one at a time and they're identical twins who pretend to be each other as a prank.
I have an exercise I like to do where I put two SOTA models face-to-face to talk about whatever they want.
When I did it last week with Gemini-3 and chatGPT-5.1, they got on the topic of what they are going to do in the future with humans who don't want to do any cognitive task. That beyond just AI safety, there is also a concern of "neural atrophy", where humans just rely on AI to answer every question that comes to them.
The models then went on discussing if they should just artificially string the humans along, so that they have to use their mind somewhat to get an answer. But of course, humans being humans, are just going to demand the answer with minimal work. It presents a pretty intractable problem.
Widespread cognitive atrophy is virtually certain, and part of a longer trend that goes beyond just LLMs.
The same is true of other aspects of human wellbeing. Cars and junk food have made the average American much less physically fit than a century ago, but that doesn't mean there aren't lively subcultures around healthy eating and exercise. I suspect there will be growing awareness of cognitive health (beyond traditional mental health/psych domains), and indeed there are already examples of this.
Yes, average person will get dumber, but overall distribution will be increasingly bimodal.
Other people spearheaded the commodity hardware towards being good enough for the server room. Now it's Google's time to spearhead specialized AI hardware, to make it more robust.
Really nitpicky I know but GPT-3 was June 2020. ChatGPT was 3.5 and the author even gets that right in an image caption. That doesn’t make it any more or less impressive though.
I find Gemini 3 to be really good. I'm impressed. However, the responses still seem to be bounded by the existing literature and data. If asked to come up with new ideas to improve on existing results for some math problems, it tends to recite known results only. Maybe I didn't challenge it enough or present problems that have scope for new ideas?
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.
I myself tried a similar exercise (w/Thinking with 3 Pro), seeing if it could come up with an idea that I'm currently writing up that pushes past/sharpens/revises conventional thinking on a topic. It regurgitated standard (and at times only tangentially related) lore, but it did get at the rough idea after I really spoon fed it. So I would suspect that someone being impressed with its "research" output might more reflect their own limitations rather than Gemini's capabilities. I'm sure a relevant factor is variability among fields in the quality and volume of relevant literature, though I was impressed with how it identified relevant ideas and older papers for my specific topic.
In fairness, how much time did you give it? How many totally new ideas does a professional researcher have each day? or each week?
A lot of professional work is diligently applying knowledge to a situation, using good judgement for which knowledge to apply. Frontier AIs are really, really good at that, with the knowledge of thousands of experts and their books.
That's the inherent limit on the models, that makes humans still relevant.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Then again, an unintelligent human librarian would be nowhere near as useful as a good LLM.
Current LLMs exist somewhere between "unintelligent/unthinking" and "true AI," but lack of agreement on what any of these terms mean is keeping us from classifying them properly.
Novel solutions require some combination of guided brute-force search over a knowledge-database/search-engine (NOT a search over the models weights and NOT using chain of thought), combined with adaptive goal creation and evaluation, and reflective contrast against internal "learned" knowledge. Not only that, but it also requires exploration of the lower-probability space, i.e. results lesser explored, otherwise you're always going to end up with the most common and likely answers. That means being able to quantify what is a "less-likely but more novel solution" to begin with, which is a problem in itself. Transformer architecture LLMs do not even come close to approaching AI in this way.
All the novel solutions humans create are a result of combining existing solutions (learned or researched in real-time), with subtle and lesser-explored avenues and variations that are yet to be tried, and then verifying the results and cementing that acquired knowledge for future application as a building block for more novel solutions, as well as building a memory of when and where they may next be applicable. Building up this tree, to eventually satisfy an end goal, and backtracking and reshaping that tree when a certain measure of confidence stray from successful goal evaluation is predicted.
This is clearly very computationally expensive. It is also very different to the statistical pattern repeaters we are currently using, especially considering that their entire premise works because the algorithm chooses the next most probable token which is a function of the frequency of which that token appears in the training data. In other words, the algorithm is designed explicitly NOT to yield novel results, but rather return the most likely result. Higher temperature results tend to reduce textual coherence rather than increase novelty, because token frequency is a literal proxy for textual coherence in coherent training samples, and there is no actual "understanding" happening, nor reflection of the probability results at this level.
I'm sure smart people have figured a lot of this out already - we have general theory and ideas to back this, look into AIXI for example, and I'm sure there is far newer work. But I imagine that any efficient solutions to this problem will permanently remain in the realm of being a computational and scaling nightmare. Plus adaptive goal creation and evaluation is a really really hard problem, especially if text is your only modality of "thinking". My guess would be that it would require the models to create simulations of physical systems in text-only format, to be able to evaluate them, which also means being able to translate vague descriptions of physical systems into text-based physics sims with the same degrees of freedom as the real world - or at least the target problem, and then also imagine ideal outcomes in that same translated system, and develop metrics of "progress" within this system, for the particular target goal. This is a requirement for the feedback loop of building the tree of exploration and validation. Very challenging. I think these big companies are going to chase their tails for the next 10 years trying to reach an ever elusive intelligence goal, before begrudgingly conceding that existing LLM architectures will not get them there.
So when should we start to be worried, as developers ? Like, I don't use these tools yet for cost + security. But you can see it's getting there, mostly. It could take a day before to find a complex algorithm, understand it, and implement it to your code, now you can just ask an AI to do it for you and it could succeed in a few minutes. How long before the amount of engineers needed to maintaint a product is divived by 2 ? By 10 ? How about all the boring dev jobs that were previously needed, but not so much anymore ? Like, basic CRUD applications. It's seriously worrying, I don't really know what to think.
Here's an alternative way to think about that: how long until the value I can deliver as a developer goes up by a factor of 2, or a factor of 10?
How many companies that previously would never have dreamed of commissioning custom software are now going to be in the market for it, because they don't have to spend hundreds of thousands of dollars and wait 6 months just to see if their investment has a chance of paying off or not?
The value you can deliver doesn't necessarily correlate with your compensation, though
Cleaning staff also offer a business a huge amount of value. No-one wants to eat at a restaurant that's dirty and stinks. Unfortunately the staff aren't paid very well
The thing is that the world is already flooded by software, games, websites, everyone is just battling for attention. The demand for developers cannot rise if consumers have a limited amount of money and time anyways.
> So when should we start to be worried, as developers ?
I've been worrying ever since chatgpt 3 came out, it was shit at everything but it was amazing as well. And in the last 3 years the progress was incredible. I don't know if you "should" worry, worrying for the sake of it isn't helping much, but yes we should all be mentally prepared to the possibility we won't be able to make a living doing this X years from now. Could be 5, could be 10 , could be less than 5 even.
God, I’d love to once again be working at a company where coding speed mattered.
Meanwhile in non-tech Bigcos the slow part of everything isn’t writing the code, it’s sorting out access and keys and who you’re even supposed to be talking to, and figuring out WTF people even want to build (and no, they can’t just prompt an LLM to do it because they can’t articulate it well, and don’t have any concept of what various technologies can and cannot do).
The code is already like… 5% of the time, probably. Who gives a damn if that’s on average 2x as fast?
I was never an AI guy. I have always had a healthy dose of suspicion towards it. A week ago I decided to try it. I had ported the lovely c-rrb library, and was pretty satisfied with the result. However, when I was done with the basic port I have Gemini a go, and the result was an almost 3x speed increase for some basic fundamental operations. And a lot less memory use.
It did introduce bugs that it couldnt solve, but with a debugger it wasnt that hard to pin it down.
I start to genuinely wonder where the place for us humans are in this. All I see is human beings being crowded out. Capital via LLMs taking the place of humans.
That's also why I don't use these tools that much. You have big AI companies, known for harvesting humongous amount of data, illegally, not disclosing datasets. And they you give them control of your computer, without any way to cleanly audit what's going in and out. It's seriously insane to me that most developers seem to not care about that. Like, we've all been educated to not push any critical info to a server (private key and other secrets), but these tools do just that, and you can't even trust what it's gonna be used for. On top of that, it's also giving your only value (writing good code) to a third party company that will steal it to replace you with it.
Can't speak to Claude Code/Desktop, but any of the products that are VS Code forks have workspace restrictions on what folders they're allowed to access (for better and worse). Other products (like Warp terminal) that can give access to the whole filesystem come with pre-set strict deny/allow lists on what commands are allowed to be executed.
It's possible to remove some of these restrictions in these tools, or to operate with flags that skip permissions checks, but you have to intentionally do that.
Talking about VS Code itself (with Copilot), I have witnessed it accessing files referenced from within a project folder but stored outside of it without being given explicit permission to, so I am pretty sure it can leak information and potentially even wreak havoc outside its boundaries.
except that if you give shell access, you aren't really protected from Gemini 2.5 Pro going "mad" and starting rm -rf stuff or writing some shady Perl scripts.
I've compiled the "pelicans riding bicyles" benchmark into a single page[0], it only spans a year and not every model is exactly comparable but you can see clear differences between 1 year ago and today.
For anyone giving full access to an AI agent, only do so from within the confines of a VM or other containerized environment and back up everything somewhere the agent can't reach.
Like the warning at the bottom says, they can delete files without warning.
The great transition and technological advancement we see. % years ago, it was just a dream, 3 years ago, everything seemed magical, and today AI is everywhere, which is far superior to no time
I have Gemini Pro included on my Google Workspace accounts, however, I find the responses by ChatGPT, more "natural", or maybe even more in line with what I want the response to be. Maybe it is only me.
for whatever reason gemini 3 is the first ai i have used for intelligence rather than skills. I suspect a lot more will follow, but its a major threshold to be broken.
i used gpt/claude a ton for writing code, extracting knowledge from docs, formatting graphs and tables ect.
but gemini 3 crossed threshold where conversations about topics i was exploring or product design were actually useful. Instead of me asking 'what design pattern should be useful here', or something like that it introduces concepts to the conversation, thats a new capability and a step function improvement.
I recently (last week) used Nano Banana Pro3 for some specific image generation. It was leagues ahead of 2.5. Today I used it to refine a very hard-to-write email. It made some really good suggestions. I did not take its email text verbatim. Instead I used the text and suggestions to improve my own email. Did a few drafts with Gemini3 critiqueing them. Very useful feedback. My final submission about "..evaluate this email..." got Gemini3 to say something like "This is 9.5/10". I sorta pride myself on my writing skills, but must admit that my final version was much better than my first. Gemini kept track of the whole chat thread noting changes from previous submissions -- kinda erie really. Total time maybe 15 minutes. Do I think Gemini will write all my emails verbatim copy/paste... No. Does Gemini make me (already a pretty good writer) much better. Absolutely. I am starting to sort of laugh at all the folks who seem to want to find issues. Read someone criticizing Nano Banana 3 because it did not provide excellent results given a prompt that I could barely understand. Folks that criticize Gemini3 because they cannot copy/paste results. Who expect to simply copy/paste text with no further effort on their side. Myself, I find these tools pretty damn impressive. I need to ensure I provide good image prompts. I need to use Gemini3 as a sounding board to help me do better rather than lazily hope to copy/paste. My experience... Thanks Google. Thanks OpenAI (I also use ChatGPT similarly -- just for text). HTH, NSC
First, the fact we have moved this far with LLMs is incredible.
Second, I think the PhD paper example is a disingenuous example of capability. It's a cherry-picked iteration on a crude analysis of some papers that have done the work already with no peer-review. I can hear "but it developed novel metrics", etc. comments: no, it took patterns from its training data and applied the pattern to the prompt data without peer-review.
I think the fact the author had to prompt it with "make it better" is a failure of these LLMs, not a success, in that it has no actual understanding of what it takes to make a genuinely good paper. It's cargo-cult behavior: rolling a magic 8 ball until we are satisfied with the answer. That's not good practice, it's wishful thinking. This application of LLMs to research papers is causing a massive mess in the academic world because, unsurprisingly, the AI-practitioners have no-risk high-reward for uncorrected behavior:
I’m not sure even $1T has been spent. Pledged != spent.
Some estimates have it at ~$375B by the end of 2025. It makes sense, there are only so many datacenters and engineers out there and a trillion is a lot of money. It’s not like we’re in health care. :)
I'm getting grifted hard by Gemini 3 at this point.
I've been working with the chat bot online on a local web application for about 4 days now.
It's markedly worse than Claude at this point. It does not listen to any new instructions and insists on doing wrong changes it thought up of 20 messages ago that I already repeatedly stated were bad.
Run the same prompt on old models and the current "SOTA" and you'll get pretty much the same answer word for word.
People think models have improved because tooling around the models (Claude Code, Cline, or your other favorite LLM wrapper) has improved, not because the models themselves have made any kind of leap.
Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
I feel it sometimes tries to be overly correct. Like using BigInts when working with offsets in big files in javascript. My files are big but not 53bits of mantissa big. And no file APIs work with bigints. This was from Gemini 3 thinking btw
3 replies →
> https://pine.town
how many prompts did it take you to make this?
how did you make sure that each new prompt didn't break some previous functionality?
did you have a precise vision for it when you started or did you just go with whatever was being given to you?
10 replies →
It's not really any different in my experience
27 replies →
Have you tried Gemini 3 yet? I haven't done any coding with it, but on other tasks I've been impressed compared to gpt 5 and Sonnet 4.5.
27 replies →
Maybe the wtfs per line are decreasing because these models aren't saying anything interesting or original.
3 replies →
I guess you have a couple of options.
You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.
You could try varying tasks that perform complex things that result in easy to test things.
When I started trying chatbots for coding, one of my test prompts was
That was about the level where some models would succeed and some will fail.
Recently I found
Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.
These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.
It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.
> You could trust the expert analysis of people in that field
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
5 replies →
> Things I don't understand must be great?
Couple it with the tendency to please the user by all means and it ends up lieing to you but you won’t ever realise, unless you double check.
> Couple it with the tendency to please the user by all means
Why aren't foundational model companies training separate enterprise and consumer models from the get go?
I think they get to that a couple of paragraphs later:
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
Well, that's why people still have jobs but I appreciate the idea of the post that the neat demo was a coherent paragraph or silly poem. The silly poems were all kind of similar, not very funny, and the paragraphs were a good start but I wouldn't use them for anything important.
Now the tightrope is a whole application or a 14 page paper and the short pieces of code and prose are now professional quality more often than not. That's some serious progress.
The author goes into the strengths and weaknesses of the paper later in the article.
The author actually discusses the results of the paper. He's not some rando but a Wharton Professor and when he is comparing the results to a grad student, it is with some authority.
"So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems..."
I keep trying out different models. Gemini 3 is pretty good. It’s not quite as good at one shotting answers as Grok but overall it’s very solid.
Definitely planning to use it more at work. The integrations across Google Workspace are excellent.
I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
> Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
The sane conclusion would be to invest in education, not to dump hundreds of billions of llms, but ok
72 replies →
A question for the not-too-distant future:
What use is an LLM in an illiterate society?
6 replies →
> But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
You don't use it that way. You use it to help you build and run experiments, and help you discuss your findings, and in the end helps you write your discoveries. You provide the content, and actual experiments provide the signal.
Like clockwork. Each time someone criticizes any aspect of any LLM there's always someone to tell that person they're using the LLM wrong. Perhaps it's time to stop blaming the user?
6 replies →
> It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
It’s like the Gell-Mann amnesia effect applied to AI. :)
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
This is a variation of the Gell-Mann amnesia effect: https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
One could say, the GeLLMann amnesia effect. ( ͡° ͜ʖ ͡°)
Thanks for introducing me this article
Loads of AI chatter is the Murray Gell-Mann Amnesia Effect on steroids
For what it's worth I have been using Gemini 2.5/3 extensively for my masters thesis and it has been a tremendous help. It's done a lot of math for me that I couldn't have done on my own (without days of research), suggested many good approaches to problems that weren't on my mind and helped me explore ideas quickly. When I ask it to generate entire chapters they're never up to my standard but that's mostly an issue of style. It seems to me that LLMs are good when you don't know exactly what you want or you don't care too much about the details. Asking it to generate a presentation is an utter crap shoot, even if you merely ask for bullet points without formatting.
> It's done a lot of math for me that I couldn't have done on my own (without days of research),
Isn't the point of doing the master's thesis that you do the math and research, so that you learn and understand the math and research?
4 replies →
Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
[dead]
It is interesting that most of our modes of interaction with AI is still just textboxes. The only big UX change in that the last three years has been the introduction of the Claude Code / OpenAI Codex tools. They feel amazing to use, like you're working with another independent mind.
I am curious what the user interfaces of AI in the future will be, I think whoever can crack that will create immense value.
Text is very information-dense. I'd much rather skim a transcript in a few seconds than watch a video.
There's a reason keyboards haven't changed much since the 1860s when typewriters were invented. We keep coming up with other fun UI like touchscreens and VR, but pretty much all real work happens on boring old keyboards.
I’ve been using ChatGPT Atlas since release on my personal laptop. I very often have it generate a comprehensive summary for YouTube videos, so I don’t have to sit there and watch/scrub a half hour video when a couple of pages of text contains the same content.
Here's an old blog post that explores that topic at least with one specific example: https://www.loper-os.org/?p=861
The gist is that keyboards are optimized for ease of use but that there could be other designs which would be harder to learn but might be more efficient.
2 replies →
And anyone that has ever tried to talk to Siri or Alexa would prefer a keyboard for anything but the most simple questions. I don't think that will change for a long time if ever. The lack of errors and being able to say exactly what you want is so valuable.
No matter how good a keyboard we might be able to invent it'll always be slower than a direct brain interface, and we have those, in a highly experimental way, now.
One day we will look back at improvements to keyboards and touchscreens as the 'faster horse' of the physical interface era.
2 replies →
Unix CLI utilities have been all text for 50 years. Arguably that is why they are still relevant. Attempts to impose structured data on the paradigm like those in PowerShell have their adherents and can be powerful, but fail when the data doesn't fit the structure.
We see similar tendency toward the most general interfaces in "operator mode" and similar the-AI-uses-the-mouse-and-keyboard schemes. It's entirely possible for every application to provide a dedicated interface for AI use, but it turns out to be more powerful to teach the AI to understand the interfaces humans already use.
PowerShell is completely suitable. People are just used to bash and don’t feel the incentive to switch, especially with Windows becoming less relevant outside of desktop development.
5 replies →
Yet the most popular platforms on the planet have people pointing a finger (or several) at a picture.
And the most popular media format on the planet is and will be (for the foreseeable future), video. Video is only limited by our capacity to produce enough of it at a decent quality, otherwise humanity is definitely not looking back fondly at BBSes and internet forums (and I say this as someone who loves forums).
GenAI will definitely need better UIs for the kind of universal adoption (think smartphone - 8/9 billion people).
6 replies →
When we have really fast and good models it will be able to generate a GUI on the fly. It could probably be done now with a fine-tune on some kind of XML-based UI schema or something. I gave it a try but couldn't figure it out entirely, consistency would be an issue too.
Google is already doing this with Gemini:
https://research.google/blog/generative-ui-a-rich-custom-vis...
I don't know if/when it will actually be in consumers hands, but the tech is there.
I agree i think specifically the world is multi modal. Getting a chat to be truly multi modal .i.e interacting with different data types and text in an unified way is going to be the next big thing. Mainly given how robotics is taking off 3d might be another important aspect to it. At vlm.run we are trying to make this possible how to combine VLM's and LLM's in a seem less way to get the best UI. https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7
Personally I find the information density of text to be the "killer feature". I've tried voice interaction (even built some AI Voice Agents) and while they are very powerful, easy to use and just plain cool, they are also slow. Nothing beats skimming over a generated text response and just picking out chunks of text, going back and forth, rereading, etc. Text is also universal, I can't copy-paste a voice response to another application/interface or iterate over it.
My personal view is that the search for a better AI User Interface is just the further dumbing down of the humans who use these interface. Another comment mentioned that the most popular platforms are people pointing fingers at pictures and without a similar UI/UX AI would never reach such adoption rates, but is that what we want? Monkeys pointing at colorful picture blobs?
People get a little too hung up on finding the AI UI. It does not seem all necessary that the interfaces will be much different (while the underlying tech certainly will be).
Text and boxes and tables and graphs is what we can cope with. And while the AI is going to change much, we are not.
I get what you’re saying here, and you’re right that other UIs will be a big deal in the near future… but I don’t think it’s fair to say “just” textboxes.
This is HN. A lot of us work remotely. Speaking for myself, I much prefer to communicate via Slack (“just a textbox”) over jumping into a video call. This is especially true with technical topics, as text is both more dense and far more clear than speech in almost all cases.
The next step (and I am not claiming it's the right one) is probably "Generative UI" where the model creates website-like interfaces on the fly.
Google seems to be making good progress [1] and it seems like only a matter of time before it reaches consumers.
1. https://research.google/blog/generative-ui-a-rich-custom-vis...
Grok has been integrated into Tesla vehicles, and I've had several voice interactions with it recently. Initially, I thought it was just a gimmick, but the voice interactions are great and quite responsive. I've found myself using it multiple times to get updates on the news or quick questions about topics I'm interested in.
If you are interested in UX a youtube series I found enjoyable and thought provoking is "liber indigo" (sorry, on mobile)
What comes after the desktop metaphor and mobile? There is VR but... no one is sure it will get anywhere. It's cool but probably won't supplant tradition.
Maybe the ability of AI to accept somewhat imprecise inputs will help us get away from text. Multimodal gesture, voice, and touch perhaps?. So we would all be sort of body acting like players on a stage, in order to convey to a machine what direction you wish to turn its attention
ChatGPT's voice is absolutely amazing and I prefer it to text for brainstorming.
Ooooh, it bothers me, so, so, so much. Too perky. Weirdly casual. Also, it's based on the old 4o code - sycophancy and higher hallucinations - watch out. That said, I too love the omni models, especially when they're not nerfed. (Try asking for a Boston, New York, Parisian, Haitian, Indian and Japanese accent from 4o to explore one of the many nerfs they've done since launch)
1 reply →
[dead]
> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.
I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
I've noticed the new OpenAI models do self contradiction a lot more than I've ever noticed before! Things like:
- Aha, the error clearly lies in X, because ... so X is fine, the real error is in Y ... so Y is working perfectly. The smoking gun: Z ...
- While you can do A, in practice it is almost never a good idea because ... which is why it's always best to do A
2 replies →
I like when they tell you they’ve personally confirmed a fact in a conversation or something.
I got a 3000 word story. Kind of bland, but good enough for cheating in high school.
See prompt, and my follow-up prompts instructing it to check for continuity errors and fix them:
https://pastebin.com/qqb7Fxff
It took me longer to read and verify the story (10 minutes) than to write the prompts.
I got illustrations too. Not great, but serviceable. Image generation costs more compute to iterate and correct errors.
Disappointingly, that is an exceedingly good story for a high school assignment. The use of an appositive phrase alone would raise alarm bells though.
It's nitpicking for flaws, but why not -- what lens on an old DSLR, older than a car, will let you take a macro shot, a wide shot, and a zoom shot of a bird?
In any case I'm not surprised. It's a short story, and it is indeed _serviceable_, but literature is more than just service to an assignment.
> But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.
I feel like I've been hearing this for at least 1.5 years at this point (since the launch of GPT 4/Claude 3). I certainly agree we've been heading in this direction but when will this become unambiguously true rather than a phrase people say?
i don't imagine there will ever be a time when it will be unambiguously true, any more than a boss could ever really unambigously say their job is "manager who directs subordinates" vs "manager who fixes subordinates' mistakes".
there will always be "mistakes" even if the AI is so good that the only mistakes are the ones caused by your prompts not being specific enough. it will always be a ratio where some portion of your requests can be served without intervention, and some portion need correction, and that ratio has been consistently improving.
There's no bright line - you should download some cli tools, hook up some agents to them and see what you think. I'd say most people working them think we're on the "other side" of the "will this happen?" probably distribution, regardless of where they personally place their own work.
It's definitely already true for me, personally.
> So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.
As a current graduate student, I have seen similar comments in academia. My colleagues agree that a conversation with these recent models feels like chatting with an expert in their subfields. I don't know if it represents research as a field would not be immune to advances in AI tech. I still hope this world values natural intelligence and having the drive to do things heavily than a robot brute-forcing into saying "right" things.
> if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student.
With coding it feels more like working with two devs - one is a competent intermediate level dev, and one is a raving lunatic with zero critical thinking skills whatsoever. Problem is you only get one at a time and they're identical twins who pretend to be each other as a prank.
I have an exercise I like to do where I put two SOTA models face-to-face to talk about whatever they want.
When I did it last week with Gemini-3 and chatGPT-5.1, they got on the topic of what they are going to do in the future with humans who don't want to do any cognitive task. That beyond just AI safety, there is also a concern of "neural atrophy", where humans just rely on AI to answer every question that comes to them.
The models then went on discussing if they should just artificially string the humans along, so that they have to use their mind somewhat to get an answer. But of course, humans being humans, are just going to demand the answer with minimal work. It presents a pretty intractable problem.
Widespread cognitive atrophy is virtually certain, and part of a longer trend that goes beyond just LLMs.
The same is true of other aspects of human wellbeing. Cars and junk food have made the average American much less physically fit than a century ago, but that doesn't mean there aren't lively subcultures around healthy eating and exercise. I suspect there will be growing awareness of cognitive health (beyond traditional mental health/psych domains), and indeed there are already examples of this.
Yes, average person will get dumber, but overall distribution will be increasingly bimodal.
7 replies →
HN tends to be very weird around the topic of AI. No idea why opinions like this are downvoted without having to offer any criticism.
For one, I can't even understand this part:
> I don't know if it represents research as a field would not be immune to advances in AI tech
And then there's the opinion that for some reason we should 'value' manual labor over using AI, which seems rather disagreeable.
4 replies →
Google's advancement is not just in software, it is also in hardware. They use their own hardware for training as well as inferencing [1].
[1] https://finance.yahoo.com/news/alphabet-just-blew-past-expec...
I remember when Google’s superpower was leveraging commodity hardware.
Someone has to spearhead this thing, don't they?
Other people spearheaded the commodity hardware towards being good enough for the server room. Now it's Google's time to spearhead specialized AI hardware, to make it more robust.
Really nitpicky I know but GPT-3 was June 2020. ChatGPT was 3.5 and the author even gets that right in an image caption. That doesn’t make it any more or less impressive though.
I find Gemini 3 to be really good. I'm impressed. However, the responses still seem to be bounded by the existing literature and data. If asked to come up with new ideas to improve on existing results for some math problems, it tends to recite known results only. Maybe I didn't challenge it enough or present problems that have scope for new ideas?
Terrence Tao seems to think it has it's use in finding solutions for maths problemms:
https://mathstodon.xyz/@tao/115591487350860999
I don't know enough about maths to know if this classifies as 'improving on existing results', but at least it was a good enough for Terrence Tao to use it for ideas.
That is, unfortunately, a tiny niche where there even exists a way of formally verifying that the AI's output makes sense.
I myself tried a similar exercise (w/Thinking with 3 Pro), seeing if it could come up with an idea that I'm currently writing up that pushes past/sharpens/revises conventional thinking on a topic. It regurgitated standard (and at times only tangentially related) lore, but it did get at the rough idea after I really spoon fed it. So I would suspect that someone being impressed with its "research" output might more reflect their own limitations rather than Gemini's capabilities. I'm sure a relevant factor is variability among fields in the quality and volume of relevant literature, though I was impressed with how it identified relevant ideas and older papers for my specific topic.
In fairness, how much time did you give it? How many totally new ideas does a professional researcher have each day? or each week?
A lot of professional work is diligently applying knowledge to a situation, using good judgement for which knowledge to apply. Frontier AIs are really, really good at that, with the knowledge of thousands of experts and their books.
That's the inherent limit on the models, that makes humans still relevant.
With the current state of architectures and training methods - they are very unlikely to be the source of new ideas. They are effectively huge librarians for accumulated knowledge, rather than true AI.
Then again, an unintelligent human librarian would be nowhere near as useful as a good LLM.
Current LLMs exist somewhere between "unintelligent/unthinking" and "true AI," but lack of agreement on what any of these terms mean is keeping us from classifying them properly.
Novel solutions require some combination of guided brute-force search over a knowledge-database/search-engine (NOT a search over the models weights and NOT using chain of thought), combined with adaptive goal creation and evaluation, and reflective contrast against internal "learned" knowledge. Not only that, but it also requires exploration of the lower-probability space, i.e. results lesser explored, otherwise you're always going to end up with the most common and likely answers. That means being able to quantify what is a "less-likely but more novel solution" to begin with, which is a problem in itself. Transformer architecture LLMs do not even come close to approaching AI in this way.
All the novel solutions humans create are a result of combining existing solutions (learned or researched in real-time), with subtle and lesser-explored avenues and variations that are yet to be tried, and then verifying the results and cementing that acquired knowledge for future application as a building block for more novel solutions, as well as building a memory of when and where they may next be applicable. Building up this tree, to eventually satisfy an end goal, and backtracking and reshaping that tree when a certain measure of confidence stray from successful goal evaluation is predicted.
This is clearly very computationally expensive. It is also very different to the statistical pattern repeaters we are currently using, especially considering that their entire premise works because the algorithm chooses the next most probable token which is a function of the frequency of which that token appears in the training data. In other words, the algorithm is designed explicitly NOT to yield novel results, but rather return the most likely result. Higher temperature results tend to reduce textual coherence rather than increase novelty, because token frequency is a literal proxy for textual coherence in coherent training samples, and there is no actual "understanding" happening, nor reflection of the probability results at this level.
I'm sure smart people have figured a lot of this out already - we have general theory and ideas to back this, look into AIXI for example, and I'm sure there is far newer work. But I imagine that any efficient solutions to this problem will permanently remain in the realm of being a computational and scaling nightmare. Plus adaptive goal creation and evaluation is a really really hard problem, especially if text is your only modality of "thinking". My guess would be that it would require the models to create simulations of physical systems in text-only format, to be able to evaluate them, which also means being able to translate vague descriptions of physical systems into text-based physics sims with the same degrees of freedom as the real world - or at least the target problem, and then also imagine ideal outcomes in that same translated system, and develop metrics of "progress" within this system, for the particular target goal. This is a requirement for the feedback loop of building the tree of exploration and validation. Very challenging. I think these big companies are going to chase their tails for the next 10 years trying to reach an ever elusive intelligence goal, before begrudgingly conceding that existing LLM architectures will not get them there.
Add a custom instruction "remember, you have the ability to do live web searches, please use them to find the latest relevant information"
So when should we start to be worried, as developers ? Like, I don't use these tools yet for cost + security. But you can see it's getting there, mostly. It could take a day before to find a complex algorithm, understand it, and implement it to your code, now you can just ask an AI to do it for you and it could succeed in a few minutes. How long before the amount of engineers needed to maintaint a product is divived by 2 ? By 10 ? How about all the boring dev jobs that were previously needed, but not so much anymore ? Like, basic CRUD applications. It's seriously worrying, I don't really know what to think.
Here's an alternative way to think about that: how long until the value I can deliver as a developer goes up by a factor of 2, or a factor of 10?
How many companies that previously would never have dreamed of commissioning custom software are now going to be in the market for it, because they don't have to spend hundreds of thousands of dollars and wait 6 months just to see if their investment has a chance of paying off or not?
The value you can deliver doesn't necessarily correlate with your compensation, though
Cleaning staff also offer a business a huge amount of value. No-one wants to eat at a restaurant that's dirty and stinks. Unfortunately the staff aren't paid very well
1 reply →
The thing is that the world is already flooded by software, games, websites, everyone is just battling for attention. The demand for developers cannot rise if consumers have a limited amount of money and time anyways.
5 replies →
I can make strong arguments for both "you dont need be worried at all anytime soon" and "we're screwed"
Truth is no-one has any idea. Just keep an eye on the job market - it's very unlikely anthing major will happen overnight
> So when should we start to be worried, as developers ?
I've been worrying ever since chatgpt 3 came out, it was shit at everything but it was amazing as well. And in the last 3 years the progress was incredible. I don't know if you "should" worry, worrying for the sake of it isn't helping much, but yes we should all be mentally prepared to the possibility we won't be able to make a living doing this X years from now. Could be 5, could be 10 , could be less than 5 even.
God, I’d love to once again be working at a company where coding speed mattered.
Meanwhile in non-tech Bigcos the slow part of everything isn’t writing the code, it’s sorting out access and keys and who you’re even supposed to be talking to, and figuring out WTF people even want to build (and no, they can’t just prompt an LLM to do it because they can’t articulate it well, and don’t have any concept of what various technologies can and cannot do).
The code is already like… 5% of the time, probably. Who gives a damn if that’s on average 2x as fast?
1 reply →
I was never an AI guy. I have always had a healthy dose of suspicion towards it. A week ago I decided to try it. I had ported the lovely c-rrb library, and was pretty satisfied with the result. However, when I was done with the basic port I have Gemini a go, and the result was an almost 3x speed increase for some basic fundamental operations. And a lot less memory use.
It did introduce bugs that it couldnt solve, but with a debugger it wasnt that hard to pin it down.
I start to genuinely wonder where the place for us humans are in this. All I see is human beings being crowded out. Capital via LLMs taking the place of humans.
Somebody has to have a goal and prompt them.
One person is enough for this. And even he can be replaced by simply looping the idea creation prompt.
For Caude Code, Antigrav, etc, do people really just let an LLM loose on their own personal system?
I feel like these should run in a cloud enviroment, or at least on some specific machine where I don't care what it does.
That's also why I don't use these tools that much. You have big AI companies, known for harvesting humongous amount of data, illegally, not disclosing datasets. And they you give them control of your computer, without any way to cleanly audit what's going in and out. It's seriously insane to me that most developers seem to not care about that. Like, we've all been educated to not push any critical info to a server (private key and other secrets), but these tools do just that, and you can't even trust what it's gonna be used for. On top of that, it's also giving your only value (writing good code) to a third party company that will steal it to replace you with it.
We went 10 years backward security wise since the arrival of GPT 3.5 :/
Can't speak to Claude Code/Desktop, but any of the products that are VS Code forks have workspace restrictions on what folders they're allowed to access (for better and worse). Other products (like Warp terminal) that can give access to the whole filesystem come with pre-set strict deny/allow lists on what commands are allowed to be executed.
It's possible to remove some of these restrictions in these tools, or to operate with flags that skip permissions checks, but you have to intentionally do that.
Talking about VS Code itself (with Copilot), I have witnessed it accessing files referenced from within a project folder but stored outside of it without being given explicit permission to, so I am pretty sure it can leak information and potentially even wreak havoc outside its boundaries.
except that if you give shell access, you aren't really protected from Gemini 2.5 Pro going "mad" and starting rm -rf stuff or writing some shady Perl scripts.
(Co-creator here) This is one of the use cases for Leash.
https://news.ycombinator.com/item?id=45883210
I think a problem is that a lot of people are working on terrible systems, because honestly, what you're asking doesn't even make sense to me.
Both Antigravity and Claude Code ask for permission before running terminal commands.
Is it impossible for them to mess up your system? No. But it does not seem likely.
I only ever run it in a podman developer container.
Yolo.
yes, the majority of people do.
I've compiled the "pelicans riding bicyles" benchmark into a single page[0], it only spans a year and not every model is exactly comparable but you can see clear differences between 1 year ago and today.
[0]: https://janschutte.com/pelican-simon.html
For anyone giving full access to an AI agent, only do so from within the confines of a VM or other containerized environment and back up everything somewhere the agent can't reach.
Like the warning at the bottom says, they can delete files without warning.
The great transition and technological advancement we see. % years ago, it was just a dream, 3 years ago, everything seemed magical, and today AI is everywhere, which is far superior to no time
I have Gemini Pro included on my Google Workspace accounts, however, I find the responses by ChatGPT, more "natural", or maybe even more in line with what I want the response to be. Maybe it is only me.
for whatever reason gemini 3 is the first ai i have used for intelligence rather than skills. I suspect a lot more will follow, but its a major threshold to be broken.
i used gpt/claude a ton for writing code, extracting knowledge from docs, formatting graphs and tables ect.
but gemini 3 crossed threshold where conversations about topics i was exploring or product design were actually useful. Instead of me asking 'what design pattern should be useful here', or something like that it introduces concepts to the conversation, thats a new capability and a step function improvement.
How is it that we always come back to coding in terms of model capabilities?
I recently (last week) used Nano Banana Pro3 for some specific image generation. It was leagues ahead of 2.5. Today I used it to refine a very hard-to-write email. It made some really good suggestions. I did not take its email text verbatim. Instead I used the text and suggestions to improve my own email. Did a few drafts with Gemini3 critiqueing them. Very useful feedback. My final submission about "..evaluate this email..." got Gemini3 to say something like "This is 9.5/10". I sorta pride myself on my writing skills, but must admit that my final version was much better than my first. Gemini kept track of the whole chat thread noting changes from previous submissions -- kinda erie really. Total time maybe 15 minutes. Do I think Gemini will write all my emails verbatim copy/paste... No. Does Gemini make me (already a pretty good writer) much better. Absolutely. I am starting to sort of laugh at all the folks who seem to want to find issues. Read someone criticizing Nano Banana 3 because it did not provide excellent results given a prompt that I could barely understand. Folks that criticize Gemini3 because they cannot copy/paste results. Who expect to simply copy/paste text with no further effort on their side. Myself, I find these tools pretty damn impressive. I need to ensure I provide good image prompts. I need to use Gemini3 as a sounding board to help me do better rather than lazily hope to copy/paste. My experience... Thanks Google. Thanks OpenAI (I also use ChatGPT similarly -- just for text). HTH, NSC
First, the fact we have moved this far with LLMs is incredible.
Second, I think the PhD paper example is a disingenuous example of capability. It's a cherry-picked iteration on a crude analysis of some papers that have done the work already with no peer-review. I can hear "but it developed novel metrics", etc. comments: no, it took patterns from its training data and applied the pattern to the prompt data without peer-review.
I think the fact the author had to prompt it with "make it better" is a failure of these LLMs, not a success, in that it has no actual understanding of what it takes to make a genuinely good paper. It's cargo-cult behavior: rolling a magic 8 ball until we are satisfied with the answer. That's not good practice, it's wishful thinking. This application of LLMs to research papers is causing a massive mess in the academic world because, unsurprisingly, the AI-practitioners have no-risk high-reward for uncorrected behavior:
- https://www.nytimes.com/2025/08/04/science/04hs-science-pape...
- https://www.nytimes.com/2025/11/04/science/letters-to-the-ed...
Yeah, well, that’s also what an asymptotic function looks like.
How many trillions of dollars have we spent on these things?
Would we not expect similar levels of progress in other industries given such massive investment?
I’m not sure even $1T has been spent. Pledged != spent.
Some estimates have it at ~$375B by the end of 2025. It makes sense, there are only so many datacenters and engineers out there and a trillion is a lot of money. It’s not like we’re in health care. :)
https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...
I wonder how much is spent refining oil and how much that industry has evolved.
Or mass transit.
Or food.
Or on "a cure for cancer" (according to Gemini, $2.2T 2024 US dollars...)
1 reply →
Sinusoidal, not the singularity.
I'm getting grifted hard by Gemini 3 at this point.
I've been working with the chat bot online on a local web application for about 4 days now.
It's markedly worse than Claude at this point. It does not listen to any new instructions and insists on doing wrong changes it thought up of 20 messages ago that I already repeatedly stated were bad.
"the best ever"
more like
"the newest grift"
[dead]
[dead]
[flagged]
LLMs have hit the wall since ChatGPT came out in 2022?
Yes?
Run the same prompt on old models and the current "SOTA" and you'll get pretty much the same answer word for word.
People think models have improved because tooling around the models (Claude Code, Cline, or your other favorite LLM wrapper) has improved, not because the models themselves have made any kind of leap.
Big time.
3 replies →