← Back to context

Comment by bjackman

1 day ago

> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now.

Oh yeah, seems obvious now you said it, but this is a great point.

I'm constantly thinking "I need to get into local models but I dread spending all that time and money without having any idea if the end result would be useful".

But obviously the answer is to start playing with open models in the cloud!

I agree but I still have that itch to have my own local model—so it's not always about cost. A hobby?

(Besides, a hopped-up Mac would never go to waste in my home if it turns out the local LLM thing was not worth the cost.)

Well they are doing that because of the nature of matrix multiplication. Specifically, LLM costs scale in the square length of a single input, let's call it N, but only linearly in the number of batched inputs.

O(M * N^2 * d)

d is a constant related to the network you're running. Batching, btw, is the reason many tools like Ollama require you to set the context length before serving requests.

Having many more inputs is way cheaper than having longer inputs. In fact, that this is the case is the reason we went for LLMs in the first place: as this allows training to proceed quickly, batching/"serving many customers" is exactly what you do during training. GPUs came in because taking 10k triangles, and then doing almost the exact same calculation batched 1920*1080 times on them is exactly what happens behind the eyes of Lara Croft.

And this is simplified because a vector input (ie. M=1) is the worst case for the hardware, so they just don't do it (and certainly not in published benchmark results). Often even older chips are hardwired to work with M set to 8 (and these days 24 or 32) for every calculation. So until you hit 20 customers/requests at the same time, it's almost entirely free in practice.

Hence: the optimization of subagents. Let's say you need an LLM to process 1 million words (let's say 1 word = 1 token for simplicity)

O(1 million words in one go) ~ 1e12 or 1 trillion operations

O(1000 times 1000 words) ~ 1e9 or 1 billion operations

O(10000 times 100 words) ~ 1e8 or 100 million operations

O(100000 times 10 words) ~ 1e7 or 10 million operations

O(one word at a time) ~ 1e6 or 1 million operations

Of course, to an extent this last way of doing things is the long known case of a recurrent neural network. Very difficult to train, but if you get it working, it speeds away like professor Snape confronted with a bar of soap (to steal a Harry Potter joke)