← Back to context

Comment by ramoz

4 years ago

I spent the past year and a half deploying a distributed backend for bert-like models & we ultimately chose a K8s architecture & "precise" affinity mapped out, which is still hard due to cpu pinning issues. On the frontend-api, Golang gives us the ability to distribute & split requests coming in (10-20M / day & batch size averaging ~3K which splits into 50 due to model constraints). Embeddings are stored on those nodes, local ssds. Those nodes are only a handful. Models run on 2 pools, 1 dedicated and one preemptible (most nodes here) which gives us cost optimization and scheduling is simplified due to K8s. We have anywhere from 120-300 of these high compute nodes.

Wondering if anyone has similar deployments and migrated to Ray. We've evaluated it but can't afford a large migration at this point & would also need to test quite a bit & rebuild our whole automation for infra and apps.

Really interested though as the infrastructure isn't cheap and every time the model updates we are basically re-architecting it. Right now we are moving everything away from python (gunicorn/flask, and MKL) to Golang as we can get better efficiencies with data serialization (numpy ops are the biggest time eaters right now ... model input vectors constructed from flatbuffers)

Have you considered Rust? It had neat python interconnect. My team are doing early experiments - offloading tight loops/expensive computations in existing Python inference app to Rust.

Are you running inference on CPU or GPU?

  • CPU, GPU doesn't work out well in our case. There is the data transfer cost as well as memory constraint & the model blows up in memory for every inference call.

    • x2 Gunicorn workers, MKL mapped to half physical cores for each... for some reason the model (tensorflow) performs better on half vs 1 worker