← Back to context

Comment by tmzt

18 hours ago

Has anybody thought about encoding AST tokens as LLM tokens, similar to how different words can have different meanings and that's reflected in their embedding?

Language keywords are almost definitely individual tokens. But I think you mean more than that. Basically replacing identifiers with special tokens as well. It’s worth a shot but there’s some practical problems.

Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.