Comment by BeetleB

6 hours ago

Ok. I think I misunderstood. So the idea is to simple set up the LLM service on the server and access it with an API like I would with any LLM provider? This way whatever application I want to use it for stays at home?

That's a bit more appealing. How much would it cost per month to have it continually online?

3 comments

BeetleB

utopiah 5 hours ago

Well it depends entirely on what you need. You can even do the training yourself on that infrastructure to rent if you want. The more you do yourself, the more private but also the more expensive it will be.

I don't want to make an ad here but I'm going to point to HuggingFace https://endpoints.huggingface.co (and to avoid singling them out just https://replicate.com/pricing too but I don't know them well) as an example with pricing.

The "beauty" IMHO of such solutions is that again you pay for what you want. If you want to use the endpoint only for 5min to test that the model and its API fits your need? OK. You want the whole month? Sure. You want 1 user, namely you? Fine, not a lot of power, you want your whole organization to use that endpoint? Scale up.

I'm going to give very rough approximation because honestly I'm not really into this so someone please adjust with source :

Apple Mac Studio M3 Ultra 96GB = $4K

~NVIDIA A100 with 80G ~ 10x perf compared to M3 Pro (obviously depends on models)

So on Replicate today a one can get an A100 for ~$5/hr which is ... about a month. But that's for 10x speed and electricity included. So very VERY approximately if you use a Mac Studio for 10 months on AI non stop (days and night) then it's arguably worth it.

If you use it less, say 2hrs/day only for inference, then I imagine it takes few years to have the equivalent and by that time I bet Replicate or HuggingFace is going to rent much faster setup for much cheaper simply because that's what they have ALL done for the last few years.

BeetleB 5 hours ago
Well, full disclosure (despite my comments above): I'm not interested in buying a Mac Studio. I was merely explaining why I thought people may prefer it.
For my own use, I'm just looking at absolute price (and convenience).
I haven't explored open weights models, so I have no idea which I'd want. It would be great to get a "frontier" model like Minimax-M2.5, but at $10/hr, it's not worth it - let alone $40/hr for GLM-5. I'd have to explore use cases for cheaper models. Likely for things related to reading emails, I can get by with a much cheaper model.
If I set one of these up, how easily is it for me to launch one of these (on the command line on my home PC) and then shut it down. Right now, when I write any app (or use OpenCode), it's frictionless. My worry is that either turning it on will be a hassle, and even worse, I'll forget to turn it off and suddenly get a big pointless bill.
If there are any guides out there on how people manage all this, it would be much appreciated.
- utopiah 4 hours ago
  
  Honestly I doubt it's worth it, hence my suggestion to make a "cold" estimation of both options.
  Well it's not exactly a guide and honestly it's quite outdated (because I stop keeping track as I just don't get the quality of results I hope for versus huge trade offs that aren't worth it for me) but I listed plenty of models and software solutions for self-hosting, at home or in the cloud at https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...
  Feels free to check it out and if there is something I can clarify, happy to try.