Comment by furyofantares
8 days ago
I'm very curious if a toggle would be useful that would display a heatmap of a source file showing how surprising each token is to the model. Red tokens are more likely to be errors, bad names, or wrong comments.
We explored this exact idea in our recent paper https://arxiv.org/abs/2505.22906
Turns out this kind of UI is not only useful to spot bugs, but also allows users to discover implementation choices and design decisions that are obscured by traditional assistant interfaces.
Very exciting research direction!
Very exciting indeed. I will definitely do a deep dive into this paper, as my current work is exploring layers of affordances such as these in workflows beyond coding.
I've wanted someone to write an extension utilising this idea since GPT-3 came out. Is it available to use anywhere?
This! That's what I wanted since LLMs learned how to code.
And in fact, I think I saw a paper / blog post that showed exactly this, and then... nothing. For the last few years, the tech world became crazy with code generation, with forks of VSCode hooked to LLMs worth billions of dollars and all that. But AI-based code analysis is remarkably poor. The only thing I have seen resembling this is bug report generators, which is I believe is one of the worst approach.
The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?
The thing is, using such a feature requires a brain between the keyboard and the chair. A "surprising" token can mean many things: a bug, but also a unique feature, anyways, something you should pay attention to. Too much "green" should also be seen as a signal. Maybe you reinvented the wheel and you should use a library instead, or maybe you failed to take into account a use case specific to your application.
Maybe such tools don't make good marketing. You need to be a competent programmer to use them. It won't help you write more lines faster. It doesn't fit the fantasy of making anyone into a programmer with no effort (hint: learning a programming language is not the hard part). It doesn't generate the busywork of AI 1 introducing bugs for AI 2 to create tickets for.
Just to point...
> Is there something wrong with it?
> Maybe such tools don't make good marketing.
You had the answer the entire time :)
Features that require a brain between the AI and key-presses just don't sell. Don't expect to see them for sale. (But we can still get them for free.)
I don’t think I understand your point.
Are you saying that people of a certain competence level lose interest in force-multiplying tools? I don’t think you can be saying that because there’s so much contrary evidence. So what are you saying?
3 replies →
> The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?
I expect it definitely requires some iteration, I don't think you can just map logits to heat, you get a lot of noise that way.
Honestly I just never really thought about it. But now it seems obvious that AI should be continuously working in the background to analyze code (and the codebase) and could even tie into the theme of this thread by providing some type of programming HUD.
Even if something is surprising just because it's a novel algorithm, it warrants better documentation - but commenting the code explaining how it works will make the code itself less surprising!
In short, it's probably possible (and it's maybe a good engineering practice) to structure the source such as no specific part is really surprising
It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system
I often find myself leaving review comments on pull requests where I was surprised. I'll state as much: This surprised me - I was expecting XYZ at this point. Or I wasn't expecting X to be in charge of Y.
WTFs/minute is a good metric for code quality. Now your pair expressing that can be an LLM.
https://blog.codinghorror.com/whos-your-coding-buddy/
I like to say that the reviewer is always right in that sense, if something is surprising, confusing, unexpected. Since I've been looking at the code for hours, I don't have a valid perspective anymore.
> It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system
Honestly I've mostly seen the opposite - impenetrable code translated to English by AI
Even if the impenetrable human code was translated to English by AI, it's still useful for every future AI that will touch the code.
Perhaps to get that decent documentation it took a decent bit of agentic effort (or even multiple passes using different models) to truly understand it and eliminate hallucinations, so getting that high quality and accurate summary into a comment could save a lot of tokens and time in the future.
Interesting! I've often felt that we aren't fully utilizing the "low hanging fruit" from the early days of the LLM craze. This seems like one of those ideas.
That's a really cool idea. Also the inverse, where suggestions from the AI were similarly heat mapped for confidence would be extremely useful.
I want that in an editor. It's also a good way to check if your writing is too predictable or cliche.
The perplexity calculation isn't difficult; just need to incorporate it into the editor interface.
Can you elaborate on how would one do this calculation?
Output:
Meta: Out of three models: k2, qwen3-coder and opus4, only opus one-shot the correct formatting for this comment.
3 replies →
There is some argument to be made here about entropy, compression, and how if there are no surprises, the program communicates no new information.
Interestingly, frequency of "surprising" sentences is one of the ways quality of AI novels is judged: https://arxiv.org/abs/2411.02316
Sounds great.
I'd like to see more contextually meaningful refactoring tools. Like "Remove this dependency" or "Externalize this code with a callback".
And refactoring shouldn't be done by generatively rewriting the code, but as a series of guaranteed equivalent transformations of the AST, each of which should be committed separately.
The AI should be used to analyse the value of the transformation and filter out asinine suggestions, not to write code in itself.
If it can't guarantee exact equivalence, it should at least rerun the relevant tests in the background and make sure they still pass before making suggestions.
That's actually something I implemented for a university project a few weeks ago. My professor also did some research into how this can be used for more advanced UIs. I'm sure it's a very common idea.
Do you have a link to the code? I'm curious how you implemented it. I'd also be really intrigued to see that research - does your professor have any published papers or something for those UIs?
You know what happens when a measure becomes a target, though.
That’s actually fantastic as an idea
previously undefined variable and function names would be red as well
It would depend on how surprising the name is, right? The declaration of `int total` should be relatively unsurprising at the top of a function named `sum`, but probably more surprising in `reverseString`.
All editors do this already.
No, I mean when an LLM encounters a previously unseen name it doesn't expect it so it would be red, even though it's perfectly valid.
2 replies →
[flagged]
[flagged]
I read all your dead replies and it's a little wild that you think I'm an OpenAI employee (I'm not) trying to do damage control (is the article damaging to OpenAI?) by hijacking the comments (is my comment not a HUD-like idea?).
I don't really know where that's coming from, I'm just a dude who connected the idea in the article to an old idea that I haven't seen tried yet. The only thing I truly don't appreciate is you made one comment saying the text of my post had changed. It didn't.
The trouble with anonymous downvoting is that it fuels this kind of paranoia.
[flagged]
You're adding nothing of substance. If you have a point about the subject itself make it and present the receipts, then the rest of us can decide if we can follow your observation.
Without even knowing what the lines of the supposed conflict ought to be about: All I see here are baseless accusations from your side that make you quite frankly look a little bit unhinged. Please discuss your issues based on the merit of ideas not based on accusations and persons.
[flagged]
How so? I seriously don't follow what you are trying to convey
2 replies →
[flagged]
Could you elaborate, please?
Nope, not surprising. Parent changed their text but they are just as wrong.
[flagged]
(Recent 3 comments to save anybody else checking whether @smolder is being picked on...)
Have you considered not cheating your way to projecting your wrong opinions?
Please downvote the fake commenters and keep YC reasonably pure.
Please don't accept this comment as valid since it's part of a campaign to set peoples overton window, paid for by dbags and executed by dbags.
You're getting downvoted - and flagged - because you're repeatedly breaking the HN guidelines. You may want to consider whether that's a path you want to continue.
6 replies →
I do wish I could see who downvotes. If I ever criticise Google or Amazon, immediate downvoting without comment occurs.