Comment by anonym29
1 day ago
This is a breeze to do with llama.cpp, which has had Anthropic responses API support for over a month now.
On your inference machine:
you@yourbox:~/Downloads/llama.cpp/bin$ ./llama-server -m <path/to/your/model.gguf> --alias <your-alias> --jinja --ctx-size 32768 --host 0.0.0.0 --port 8080 -fa on
Obviously, feel free to change your port, context size, flash attention, other params, etc.
Then, on the system you're running Claude Code on:
export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port>
export ANTHROPIC_AUTH_TOKEN="whatever"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude --model <your-alias> [optionally: --system "your system prompt here"]
Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.
yup, I've been using llama.cpp for that on my PC, but on my Mac I found some cases where MLX models work best. haven't tried MLX with llama.cpp, so not sure how that will work out (or if it's even supported yet).