Comment by tmzt

18 hours ago

Has anybody thought about encoding AST tokens as LLM tokens, similar to how different words can have different meanings and that's reflected in their embedding?

1 comment

tmzt

janalsncm 17 hours ago

Language keywords are almost definitely individual tokens. But I think you mean more than that. Basically replacing identifiers with special tokens as well. It’s worth a shot but there’s some practical problems.

Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.