Comment by mercutio2

12 days ago

What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.

6 comments

mercutio2

embedding-shape 12 days ago

I have my own agent harness, and the inference backend is vLLM.

mercutio2 11 days ago

Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.
I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.
storystarling 12 days ago
Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?
- embedding-shape 12 days ago
  
  I don't, fits on my card with the full context, I think the native MXFP4 weights takes ~70GB of VRAM (out of 96GB available, RTX Pro 6000), so I still have room to spare to run GPT-OSS-20B alongside for smaller tasks too, and Wayland+Gnome :)
  
  2 replies →