Comment by 7777777phil
2 days ago
32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.
I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...
No comments yet
Contribute on Hacker News ↗