Comment by furyofantares

7 months ago

I'm very curious if a toggle would be useful that would display a heatmap of a source file showing how surprising each token is to the model. Red tokens are more likely to be errors, bad names, or wrong comments.

64 comments

furyofantares

teoremma 7 months ago

We explored this exact idea in our recent paper https://arxiv.org/abs/2505.22906

Turns out this kind of UI is not only useful to spot bugs, but also allows users to discover implementation choices and design decisions that are obscured by traditional assistant interfaces.

Very exciting research direction!

reneherse 7 months ago

Very exciting indeed. I will definitely do a deep dive into this paper, as my current work is exploring layers of affordances such as these in workflows beyond coding.
zacmps 7 months ago

I've wanted someone to write an extension utilising this idea since GPT-3 came out. Is it available to use anywhere?

GuB-42 7 months ago

This! That's what I wanted since LLMs learned how to code.

And in fact, I think I saw a paper / blog post that showed exactly this, and then... nothing. For the last few years, the tech world became crazy with code generation, with forks of VSCode hooked to LLMs worth billions of dollars and all that. But AI-based code analysis is remarkably poor. The only thing I have seen resembling this is bug report generators, which is I believe is one of the worst approach.

The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?

The thing is, using such a feature requires a brain between the keyboard and the chair. A "surprising" token can mean many things: a bug, but also a unique feature, anyways, something you should pay attention to. Too much "green" should also be seen as a signal. Maybe you reinvented the wheel and you should use a library instead, or maybe you failed to take into account a use case specific to your application.

Maybe such tools don't make good marketing. You need to be a competent programmer to use them. It won't help you write more lines faster. It doesn't fit the fantasy of making anyone into a programmer with no effort (hint: learning a programming language is not the hard part). It doesn't generate the busywork of AI 1 introducing bugs for AI 2 to create tickets for.

marcosdumay 7 months ago
Just to point...
> Is there something wrong with it?
> Maybe such tools don't make good marketing.
You had the answer the entire time :)
Features that require a brain between the AI and key-presses just don't sell. Don't expect to see them for sale. (But we can still get them for free.)
- brookst 7 months ago
  
  I don’t think I understand your point.
  Are you saying that people of a certain competence level lose interest in force-multiplying tools? I don’t think you can be saying that because there’s so much contrary evidence. So what are you saying?
  
  4 replies →
furyofantares 7 months ago

> The idea you have, that I also had and I am sure many thousands of other people had seem so obvious, why is no one talking about it? Is there something wrong with it?
I expect it definitely requires some iteration, I don't think you can just map logits to heat, you get a lot of noise that way.
b_e_n_t_o_n 7 months ago

Honestly I just never really thought about it. But now it seems obvious that AI should be continuously working in the background to analyze code (and the codebase) and could even tie into the theme of this thread by providing some type of programming HUD.

nextaccountic 7 months ago

Even if something is surprising just because it's a novel algorithm, it warrants better documentation - but commenting the code explaining how it works will make the code itself less surprising!

In short, it's probably possible (and it's maybe a good engineering practice) to structure the source such as no specific part is really surprising

It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system

Kichererbsen 7 months ago
I often find myself leaving review comments on pull requests where I was surprised. I'll state as much: This surprised me - I was expecting XYZ at this point. Or I wasn't expecting X to be in charge of Y.
- _kb 7 months ago
  
  WTFs/minute is a good metric for code quality. Now your pair expressing that can be an LLM.
  https://blog.codinghorror.com/whos-your-coding-buddy/
- federiconafria 7 months ago
  
  I like to say that the reviewer is always right in that sense, if something is surprising, confusing, unexpected. Since I've been looking at the code for hours, I don't have a valid perspective anymore.
philipwhiuk 7 months ago
> It reminds me how LLMs finally made people to care about having good documentation - if not for other people, for the AIs to read and understand the system
Honestly I've mostly seen the opposite - impenetrable code translated to English by AI
- criley2 7 months ago
  
  Even if the impenetrable human code was translated to English by AI, it's still useful for every future AI that will touch the code.
  Perhaps to get that decent documentation it took a decent bit of agentic effort (or even multiple passes using different models) to truly understand it and eliminate hallucinations, so getting that high quality and accurate summary into a comment could save a lot of tokens and time in the future.

digdugdirk 7 months ago

Interesting! I've often felt that we aren't fully utilizing the "low hanging fruit" from the early days of the LLM craze. This seems like one of those ideas.

dclowd9901 7 months ago

That's a really cool idea. Also the inverse, where suggestions from the AI were similarly heat mapped for confidence would be extremely useful.

ijk 7 months ago

I want that in an editor. It's also a good way to check if your writing is too predictable or cliche.

The perplexity calculation isn't difficult; just need to incorporate it into the editor interface.

newswasboring 7 months ago

Can you elaborate on how would one do this calculation?

irthomasthomas 7 months ago

    import openai, math, os, textwrap, json, sys
    query = 'Paris is the capital of'  # short demo input
    os.environ['OPENAI_API_KEY']       # check key early
    client = openai.OpenAI()
    resp = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'user', 'content': query}],
        max_tokens=12,
        logprobs=True,
        top_logprobs=1
    )

    logprobs = [t.logprob for t in resp.choices[0].logprobs.content]
    perplexity = math.exp(-sum(logprobs) / len(logprobs))
    print('Prompt: "', query, '"', sep='')
    print('\nCompletion:', resp.choices[0].message.content)
    print('\nToken count:', len(logprobs))
    print('Perplexity:', round(perplexity, 2))

Output:

    Prompt: "Paris is the capital of"
    
    Completion:  France.
    
    Token count: 2
    Perplexity: 1.17

Meta: Out of three models: k2, qwen3-coder and opus4, only opus one-shot the correct formatting for this comment.

3 replies →

nojs 7 months ago

There is some argument to be made here about entropy, compression, and how if there are no surprises, the program communicates no new information.

Interestingly, frequency of "surprising" sentences is one of the ways quality of AI novels is judged: https://arxiv.org/abs/2411.02316

geon 7 months ago

Sounds great.

I'd like to see more contextually meaningful refactoring tools. Like "Remove this dependency" or "Externalize this code with a callback".

And refactoring shouldn't be done by generatively rewriting the code, but as a series of guaranteed equivalent transformations of the AST, each of which should be committed separately.

The AI should be used to analyse the value of the transformation and filter out asinine suggestions, not to write code in itself.

8n4vidtmkvmk 7 months ago
If it can't guarantee exact equivalence, it should at least rerun the relevant tests in the background and make sure they still pass before making suggestions.
- geon 7 months ago
  
  It would be awesome if an ai could speculatively refactor sections of code to an exact equivalent where inconsistency stands out more. That way you could find bugs and make actual features more understandable.

tionis 7 months ago

That's actually something I implemented for a university project a few weeks ago. My professor also did some research into how this can be used for more advanced UIs. I'm sure it's a very common idea.

digdugdirk 7 months ago

Do you have a link to the code? I'm curious how you implemented it. I'd also be really intrigued to see that research - does your professor have any published papers or something for those UIs?

layer8 7 months ago

You know what happens when a measure becomes a target, though.

jama211 7 months ago

That’s actually fantastic as an idea

WithinReason 7 months ago

previously undefined variable and function names would be red as well

fwip 7 months ago

It would depend on how surprising the name is, right? The declaration of `int total` should be relatively unsurprising at the top of a function named `sum`, but probably more surprising in `reverseString`.
Too 7 months ago
All editors do this already.
- WithinReason 7 months ago
  
  No, I mean when an LLM encounters a previously unseen name it doesn't expect it so it would be red, even though it's perfectly valid.
  
  2 replies →

smolder 7 months ago

[flagged]

smolder 7 months ago

[flagged]

furyofantares 7 months ago
I read all your dead replies and it's a little wild that you think I'm an OpenAI employee (I'm not) trying to do damage control (is the article damaging to OpenAI?) by hijacking the comments (is my comment not a HUD-like idea?).
I don't really know where that's coming from, I'm just a dude who connected the idea in the article to an old idea that I haven't seen tried yet. The only thing I truly don't appreciate is you made one comment saying the text of my post had changed. It didn't.
- willtemperley 7 months ago
  
  The trouble with anonymous downvoting is that it fuels this kind of paranoia.

smolder 7 months ago

[flagged]

atoav 7 months ago

You're adding nothing of substance. If you have a point about the subject itself make it and present the receipts, then the rest of us can decide if we can follow your observation.
Without even knowing what the lines of the supposed conflict ought to be about: All I see here are baseless accusations from your side that make you quite frankly look a little bit unhinged. Please discuss your issues based on the merit of ideas not based on accusations and persons.
smolder 7 months ago
[flagged]
- teoremma 7 months ago
  
  How so? I seriously don't follow what you are trying to convey
  
  2 replies →

smolder 7 months ago

[flagged]

rambambram 7 months ago

Could you elaborate, please?

smolder 7 months ago

Nope, not surprising. Parent changed their text but they are just as wrong.

smolder 7 months ago
[flagged]
- mellosouls 7 months ago
  
  (Recent 3 comments to save anybody else checking whether @smolder is being picked on...)
  Have you considered not cheating your way to projecting your wrong opinions?
  Please downvote the fake commenters and keep YC reasonably pure.
  Please don't accept this comment as valid since it's part of a campaign to set peoples overton window, paid for by dbags and executed by dbags.
- vidarh 7 months ago
  
  You're getting downvoted - and flagged - because you're repeatedly breaking the HN guidelines. You may want to consider whether that's a path you want to continue.
  
  6 replies →
- willtemperley 7 months ago
  
  I do wish I could see who downvotes. If I ever criticise Google or Amazon, immediate downvoting without comment occurs.