Comment by rao-v
9 hours ago
I think your analysis is right this would make sense mostly for the 30B-3A style models that are mostly for edge / hobbyist use, where context length is precious so nobody is batching.
Given that experts live per layer I dont think it makes sense to have orbital mechanics experts but … I have wondered about swapping out the bottom 10% of layers per topic given that that is likely where the highest order concepts live. I’ve always wondered why people bother with LORA on all layers given that the early layers are more likely to be topic agnostic and focused on more basic pattern assembly (see the recent papers on how LLMs count on a manifold)
No comments yet
Contribute on Hacker News ↗