Comment by radarsat1

13 days ago

I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

  • I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.