← Back to context

Comment by kgeist

14 hours ago

>The short answer is that variable names are one of the things that confuses LLMs rather than helps them. Unlike with humans, names undermine a model's efforts to keep track of state over larger scales. Models confuse similarly named variables in different parts of the codebase easily

So I wonder, doesn't this apply to function names too, which the author keeps in? I've seen LLMs use wrong functions/classes as well.

I think a proper harness, LSP and tests already solve everything Vera is trying to solve. They mostly cite research from 2021 before coding harnesses and agentic loops were a thing, back when they were basically trying to one-shot with relatively weak models (by modern standards)

The only way the author could have come up with that rationale is that he doesn't understand what a token is, what attention is and how coding agents work.

Tokens combine multiple characters into a single vector. Attention computes similarity scores between vectors. This means you'd want each variable to be a single token so that the LLM can instantly know that two names refer to the same variable. If everything is numbered, the attention mechanism will attend every first parameter to every first parameter in every function. This means that the numbering scheme would have to be randomized instead of starting at zero.

Coding agents are now capable of using tools, including text search, which means that having the ability to look for specific variable names is extremely helpful. By using numbering, the author of the language has now given himself the burden of relying entirely on LSPs rather than innate model properties that operate on the text level.

So yeah, on a textual level, the language is designed for an era of LLMs that has been obsolete for a long time.