Comment by scosman

14 days ago

128 exports at 17B active parameters. This is going to be fun to play with!

does the entire model have to be loaded in VRAM? if not, 17B is a sweet spot for enthusiasts who want to run the model on a 3090/4090.

  • Yes. MoE models tipically use a different set of experts at each token. So while the "compute" is similar to a dense model equal to the "active" parameters, the VRAM requirements are larger. You could technically run inference & swap the models around, but the latency would be pretty horrendous.

  • Oh for perf reasons you’ll want it all in vram or unified memory. This isn’t a great local model for 99% of people.

    I’m more interested in playing around with quality given the fairly unique “breadth” play.

    And servers running this should be very fast and cheap.