Obviously, feel free to change your port, context size, flash attention, other params, etc.
Then, on the system you're running Claude Code on:
export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port>
export ANTHROPIC_AUTH_TOKEN="whatever"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude --model <your-alias> [optionally: --system "your system prompt here"]
Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.
yup, I've been using llama.cpp for that on my PC, but on my Mac I found some cases where MLX models work best. haven't tried MLX with llama.cpp, so not sure how that will work out (or if it's even supported yet).
This is a breeze to do with llama.cpp, which has had Anthropic responses API support for over a month now.
On your inference machine:
Obviously, feel free to change your port, context size, flash attention, other params, etc.
Then, on the system you're running Claude Code on:
Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.
yup, I've been using llama.cpp for that on my PC, but on my Mac I found some cases where MLX models work best. haven't tried MLX with llama.cpp, so not sure how that will work out (or if it's even supported yet).
Well, to whoever downvoted my comment: It's supported now!!!! https://lmstudio.ai/blog/claudecode