Comment by sillysaurusx

4 years ago

Yeah, we’ll have to agree to disagree that that’s a good idea.

#3 is the crux of Ray’s power. I don’t understand why serverless is necessary. By default, I want to configure it to use my TPUs, which I control. I’m the one who creates or deletes TPUs — because I have to be! There’s no “TPU creation service” that can autoscale TPUs on demand.

Maybe there will be one day, but not today. And focusing on today will help you long term, because it’s a very real need that people actually have. It’s hard to predict what people will want, other than to cater to existing needs. It’s how every great startup got their footing: build something people want.

In that context, how could it be otherwise? When a TPU comes up, it connects to the ray cluster. That TPU cannot be shut down. I’m using it for training, for inference, for every reason I would be paying for. Interrupting those tasks isn’t something that I can “just be okay with.”

I agree that inference tasks should be able to die occasionally. But training tasks, all the time? It’s not an abstraction — you can’t pretend that the training task can run on an arbitrary TPU. It needs to run on a specific TPU, because that TPU has the model loaded.