Comment by ThouYS

1 day ago

I feel like there is no portable advice for performance. A torch model exported as onnx is a different model.

That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep.

And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess

That's interesting as I was considering GGUF --> ONNX conversions (via Olive), but if this creates unknown distortions in the effectiveness and stability, it might be a dead-end idea.