Comment by ramoz

4 years ago

I spent the past year and a half deploying a distributed backend for bert-like models & we ultimately chose a K8s architecture & "precise" affinity mapped out, which is still hard due to cpu pinning issues. On the frontend-api, Golang gives us the ability to distribute & split requests coming in (10-20M / day & batch size averaging ~3K which splits into 50 due to model constraints). Embeddings are stored on those nodes, local ssds. Those nodes are only a handful. Models run on 2 pools, 1 dedicated and one preemptible (most nodes here) which gives us cost optimization and scheduling is simplified due to K8s. We have anywhere from 120-300 of these high compute nodes.

Wondering if anyone has similar deployments and migrated to Ray. We've evaluated it but can't afford a large migration at this point & would also need to test quite a bit & rebuild our whole automation for infra and apps.

Really interested though as the infrastructure isn't cheap and every time the model updates we are basically re-architecting it. Right now we are moving everything away from python (gunicorn/flask, and MKL) to Golang as we can get better efficiencies with data serialization (numpy ops are the biggest time eaters right now ... model input vectors constructed from flatbuffers)

4 comments

ramoz

vvladymyrov 4 years ago

Have you considered Rust? It had neat python interconnect. My team are doing early experiments - offloading tight loops/expensive computations in existing Python inference app to Rust.

vvladymyrov 4 years ago

Are you running inference on CPU or GPU?

ramoz 4 years ago
CPU, GPU doesn't work out well in our case. There is the data transfer cost as well as memory constraint & the model blows up in memory for every inference call.
- ramoz 4 years ago
  
  x2 Gunicorn workers, MKL mapped to half physical cores for each... for some reason the model (tensorflow) performs better on half vs 1 worker