Comment by rapatel0

1 month ago

I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.

Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models

4 comments

rapatel0

altruios 1 month ago

What is your exp on performance +40k tokens? I've not gone past that as I've heard reports that were problems start to arise. I'd be happy to know your experience in that regard.

rapatel0 1 month ago

I'm super happy with the performance, I generally run with 2 parallel slots so I only get about 128K context window. My experience with all llms is that they get more forgetful if you use the full window. (256-512K is the sweet spot for frontier models, 128k works for me with this current qwen)

dmichulke 1 month ago

Forgive my ignorance but aren't they already on huggingface?

I assumed turboquant optimizations are already everywhere - in llama-cpp, or the quantization machinery of unsloth and the likes.

rapatel0 1 month ago

I forked it to also add rotorquant. This is a specific optimization that uses clifford rotors instead of static compile time random purmutation to store the activations. Reduces space and parameter count for the storage.