Comment by OtherShrezzing

11 days ago

A useful feature would be slow-mode which gets low cost compute on spot pricing.

I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

22 comments

OtherShrezzing

spondyl 11 days ago

https://platform.claude.com/docs/en/build-with-claude/batch-...

> The Batches API offers significant cost savings. All usage is charged at 50% of the standard API prices.

jaytaylor 11 days ago
Can this work for Claude? I think it might be raw API only.
- spondyl 11 days ago
  
  I'm not sure I understand the question? Are you perhaps asking if messages can be batched via Claude Code and/or the Claude web UI?
  
  2 replies →

stavros 11 days ago

OpenAI offers that, or at least used to. You can batch all your inference and get much lower prices.

airspresso 11 days ago

Still do. Great for workloads where it's okay to bundle a bunch of requests and wait some hours (up to 24h, usually done faster) for all of them to complete.

mrklol 11 days ago

Yep same, I often think why this isn’t a thing yet. Running some tasks in the night at e.g. 50% of the costs - there’s the batch api but that is not integrated in e.g. claude code

gardnr 11 days ago

The discount MAX plans are already on slow-mode.

guerrilla 11 days ago

> I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.

weird-eye-issue 11 days ago
Yeah just run a LLM with over 100 billion parameters on a CPU.
- kristjansson 11 days ago
  
  200 GB is an unfathomable amount of main memory for a CPU
  (with apologies for snark,) give gpt-oss-120b a try. It’s not fast at all, but it can generate on CPU.
  
  2 replies →
bethekidyouwant 11 days ago
Run what exactly?
- all2 11 days ago
  
  I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.
  
  4 replies →
gruez 11 days ago
Does that even work out to be cheaper, once you factor in how much extra power you'd need?
- HumanOstrich 11 days ago
  
  How much extra power do you think you would need to run an LLM on a CPU (that will fit in RAM and be useful still)? I have a beefy CPU and if I ran it 24/7 for a month it would only cost about $30 in electricity.