Comment by treksis 7 hours ago how fast is this compare to python based? 5 comments treksis Reply rcarmo 6 hours ago The Python libraries are themselves written in C/C++, so what this does performance-wise is, at best, cutting through some glue. Don't think about this as a performance-driven implementation. antirez 5 hours ago Very slow currently, I added the benchmarks in the README. To go faster it needs to implement inference faster than the current float32-only kernels. throwaway314155 5 hours ago PyTorch MPS is about 10x faster per the README.md. antirez 4 hours ago I cut the difference in speed by half by taking the activations on the GPU. Time to sleep but will continue tomorrow. Numerlor 1 hour ago Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?
rcarmo 6 hours ago The Python libraries are themselves written in C/C++, so what this does performance-wise is, at best, cutting through some glue. Don't think about this as a performance-driven implementation.
antirez 5 hours ago Very slow currently, I added the benchmarks in the README. To go faster it needs to implement inference faster than the current float32-only kernels.
throwaway314155 5 hours ago PyTorch MPS is about 10x faster per the README.md. antirez 4 hours ago I cut the difference in speed by half by taking the activations on the GPU. Time to sleep but will continue tomorrow. Numerlor 1 hour ago Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?
antirez 4 hours ago I cut the difference in speed by half by taking the activations on the GPU. Time to sleep but will continue tomorrow. Numerlor 1 hour ago Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?
Numerlor 1 hour ago Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?
The Python libraries are themselves written in C/C++, so what this does performance-wise is, at best, cutting through some glue. Don't think about this as a performance-driven implementation.
Very slow currently, I added the benchmarks in the README. To go faster it needs to implement inference faster than the current float32-only kernels.
PyTorch MPS is about 10x faster per the README.md.
I cut the difference in speed by half by taking the activations on the GPU. Time to sleep but will continue tomorrow.
Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?