Comment by jychang

1 month ago

They didn't do something stupid like Llama 4 "one active expert", but 4 of 256 is very sparse. It's not going to get close to Deepseek or GLM level performance unless they trained on the benchmarks.

I don't think that was a good move. No other models do this.