← Back to context

Comment by 0xbadcafebee

2 days ago

For non-Mac users:

A laptop with an iGPU and loads of system RAM has the advantage of being able to use system ram in addition to VRAM to load models (assuming your gpu driver supports it, which most do afaik), so load up as much system RAM as you can. The downside is, the system RAM is less fast than dedicated GDDR5. These GPUs would be Radeon 890M and Intel Arc (previous generations are still decently good, if that's more affordable for you).

A laptop with a discrete GPU will not be able to load models as large directly to GPU, but with layer offloading and a quantized MoE model, you can still get quite fast performance with modern low-to-medium-sized models.

Do not get less than 32GB RAM for any machine, and max out the iGPU machine's RAM. Also try to get a bigass NVMe drive as you will likely be downloading a lot of big models, and should be using a VM with Docker containers, so all that adds up to steal away quite a bit of drive space.

Final thought: before you spend thousands on a machine, consider that there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now. Do the math before you purchase a machine; unless you are doing 24/7/365 inference, the cloud is fastly more cost effective.

> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now.

Oh yeah, seems obvious now you said it, but this is a great point.

I'm constantly thinking "I need to get into local models but I dread spending all that time and money without having any idea if the end result would be useful".

But obviously the answer is to start playing with open models in the cloud!

  • I agree but I still have that itch to have my own local model—so it's not always about cost. A hobby?

    (Besides, a hopped-up Mac would never go to waste in my home if it turns out the local LLM thing was not worth the cost.)

  • Well they are doing that because of the nature of matrix multiplication. Specifically, LLM costs scale in the square length of a single input, let's call it N, but only linearly in the number of batched inputs.

    O(M * N^2 * d)

    d is a constant related to the network you're running. Batching, btw, is the reason many tools like Ollama require you to set the context length before serving requests.

    Having many more inputs is way cheaper than having longer inputs. In fact, that this is the case is the reason we went for LLMs in the first place: as this allows training to proceed quickly, batching/"serving many customers" is exactly what you do during training. GPUs came in because taking 10k triangles, and then doing almost the exact same calculation batched 1920*1080 times on them is exactly what happens behind the eyes of Lara Croft.

    And this is simplified because a vector input (ie. M=1) is the worst case for the hardware, so they just don't do it (and certainly not in published benchmark results). Often even older chips are hardwired to work with M set to 8 (and these days 24 or 32) for every calculation. So until you hit 20 customers/requests at the same time, it's almost entirely free in practice.

    Hence: the optimization of subagents. Let's say you need an LLM to process 1 million words (let's say 1 word = 1 token for simplicity)

    O(1 million words in one go) ~ 1e12 or 1 trillion operations

    O(1000 times 1000 words) ~ 1e9 or 1 billion operations

    O(10000 times 100 words) ~ 1e8 or 100 million operations

    O(100000 times 10 words) ~ 1e7 or 10 million operations

    O(one word at a time) ~ 1e6 or 1 million operations

    Of course, to an extent this last way of doing things is the long known case of a recurrent neural network. Very difficult to train, but if you get it working, it speeds away like professor Snape confronted with a bar of soap (to steal a Harry Potter joke)

> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud

Do you have some links?

Also I assume the privacy implications are vastly different compared to running locally?

  • Throw a rock and you'll hit one... Groq (not Grok, elon stole the name), Mistral, SiliconFlow, Clarifai, Hyperbolic, Databricks, Together AI, Fireworks AI, CompactifAI, Nebius Base, Featherless AI, Hugging Face (they do inference too), Cohere, Baseten, DeepInfra, Fireworks AI, DeepSeek, Novita AI, OpenRouter, xAI, Perplexity Labs, AI21, OctoAI, Reka, Cerebras, Fal AI, Nscale, OVHcloud AI, Public AI, Replicate, SambaNova, Scaleway, WaveSpeedAI, Z.ai, GMI Cloud, Nebius, Tensorwave, Lamini, Predibase, FriendliAI, Shadeform, Qualcomm Cloud, Alibaba Cloud AI, Poe, Bento LLM, BytePlus ModelArk, InferenceAI, IBM Wastonx.AI, AWS Bedrock, Microsoft, Google

  • I use Ollama Cloud. $20/mo and I never come close to hitting quota (YMMV obviously).

    They don't log anything, and they use US datacenters.

  • for privacy preserving direct inference: Fireworks ai nebius

    otherwise openrouter for routing to lots of different providers.