it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models.
If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.
it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models. If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.
By performance, I meant the accuracy of the model, not the runtime/memory characteristics.
Nor that the training from scratch will even work.
exactly, that is the current objective. To proove that generation for a specific domain is on-par with causal attention models