Comment by anonym29

22 days ago

This is a breeze to do with llama.cpp, which has had Anthropic responses API support for over a month now.

On your inference machine:

  you@yourbox:~/Downloads/llama.cpp/bin$ ./llama-server -m <path/to/your/model.gguf> --alias <your-alias> --jinja --ctx-size 32768 --host 0.0.0.0 --port 8080 -fa on

Obviously, feel free to change your port, context size, flash attention, other params, etc.

Then, on the system you're running Claude Code on:

  export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port>
  export ANTHROPIC_AUTH_TOKEN="whatever"
  export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
  claude --model <your-alias> [optionally: --system "your system prompt here"]

Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.

1 comment

anonym29

huydotnet 22 days ago

yup, I've been using llama.cpp for that on my PC, but on my Mac I found some cases where MLX models work best. haven't tried MLX with llama.cpp, so not sure how that will work out (or if it's even supported yet).