Comment by stuaxo

6 months ago

The server in llama-cpp is documented as being only for demonstration, but ollama supports it as a model to run it.

For work, we are given Macs and so the GPU can't be passed through to docker.

I wanted a client/server where the server has the LLM and runs outside of Docker, but without me having to write the client/server part.

I run my model in ollama, then inside the code use litellm to speak to it during local development.