Comment by radarsat1

1 month ago

I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

2 comments

radarsat1

storystarling 1 month ago

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

radarsat1 1 month ago

I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.