Comment by fc417fc802
16 hours ago
fast != practical
You can get lots of tokens per second on the CPU if the entire network fits in L1 cache. Unfortunately the sub 64 kiB model segment isn't looking so hot.
But actually ... 3000? Did GP misplace one or two zeros there?
I wondered the same, but the rendering seems right, the output was almost instant. I'll recheck the token counter; anyway as you say, fast isn't practical. Actually I had to develop my own tiny model https://huggingface.co/xaskasdf/brandon-tiny-10m-instruct to fit something "usable", and it's basically a liar or disinformation machine haha