Comment by storystarling
12 days ago
Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.
I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.