← Back to context

Comment by sluongng

3 hours ago

There are plenty of cool advancements in reducing inference cold start when I was meeting with folks in person at FOSDEM this year. However, I still struggle to understand: why would folks care about this?

Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.

Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.

So my question is: who is this for? AWS and GCP running Anthropic models?

At least folks like me care about it. My local hardware is more than enough to handle my app, but given Spectrum's internet service is as fickle as a broken fiddle I'm forced to rent a dedicated cloud gpu that sits idle most days. However, I would save a serious chunk of change if I could boot up a GPU snapshot in ~10s. I evaluated various options a while back and, while modal.com was the fastest, it still took around a minute-ish. Granted, my use case is unique, but I imagine this could be a decent solution for gpu-poor ComfyUI users.

Everybody that's not a major AI lab :) There are deep spot and on-demand markets for GPUs. Many of their buyers sell SaaS, and use GPUs to serve LLM-based features.

Those features would be really expensive if they left the same GPUs running 24/7. Some SaaS products even see minute-by-minute fluctuations in demand. Fast cold starts allow them granular reactivity to their usage curves.

Also see: Silicon Valley runs on customized Chinese AI - https://x.com/petergyang/status/2042248752157839793