Comment by storystarling

12 days ago

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

1 comment

storystarling

radarsat1 11 days ago

I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.