Comment by meatmanek

2 days ago

This model is pretty cool if you don't have a GPU - I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone. (I don't remember if that was with q4 or q8.)

Otherwise, if you have a GPU with more than like 4GB of VRAM, there are better models. Gemma4 and Qwen3.6 (or Qwen3.5 if you need the smaller dense models that haven't yet been released for 3.6) are a good place to start.

2 comments

meatmanek

aziis98 1 day ago

> I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone

What are you using for inference? I have a recent intel laptop with 32GB of DDR5 and I am getting at most 25tps with the llama cpp vulkan backend (that is the fastest, I also tried sycl but it is a bit slower)

meatmanek 1 day ago
Ok, I double-checked, and I get 21-22tps with lmstudio-community/LFM2-24B-A2B-Q4_K_M.gguf running under LM Studio on my i5-12400 with 2x32GB sticks of DDR4 3200. This is with small context (just "Write me a poem about a language model named Liquid" in `lms chat`)
Prediction Stats: Stop Reason: eosFound Tokens/Second: 21.10 Time to First Token: 1.827s Prompt Tokens: 42 Predicted Tokens: 187 Total Tokens: 229