Comment by robertnishihara

4 years ago

Thanks for the comments! A few quick notes

The term serverless is a bit overloaded. Here's what we want. (1) Ray users should be able to focus only on their application logic and not have to worry about configuring clusters at least in the common case (this is not saying that they can't configure clusters if they want to). (2) In addition, we want Ray applications to be portable and to run on different size clusters with different instance types on different cloud providers or k8s or your laptop or anywhere. The application should be decoupled from the cluster configuration. (3) Great support for autoscaling Ray clusters. When a Ray application needs more resources of a certain type (CPUs, GPUs, memory, etc), those resources should be added to the cluster. When they are no longer needed, the cluster should scale down.

This is quite different from FaaS, though seems in line with the spirit of serverless. And these "serverless" properties are not something we think of as separate from Ray, but rather as part of just making Ray work better.

Yeah, we’ll have to agree to disagree that that’s a good idea.

#3 is the crux of Ray’s power. I don’t understand why serverless is necessary. By default, I want to configure it to use my TPUs, which I control. I’m the one who creates or deletes TPUs — because I have to be! There’s no “TPU creation service” that can autoscale TPUs on demand.

Maybe there will be one day, but not today. And focusing on today will help you long term, because it’s a very real need that people actually have. It’s hard to predict what people will want, other than to cater to existing needs. It’s how every great startup got their footing: build something people want.

In that context, how could it be otherwise? When a TPU comes up, it connects to the ray cluster. That TPU cannot be shut down. I’m using it for training, for inference, for every reason I would be paying for. Interrupting those tasks isn’t something that I can “just be okay with.”

I agree that inference tasks should be able to die occasionally. But training tasks, all the time? It’s not an abstraction — you can’t pretend that the training task can run on an arbitrary TPU. It needs to run on a specific TPU, because that TPU has the model loaded.

What am I missing?

Just to add on in here, it may be fixed now but the most recent time I tried to use ray on my on-prem cluster, it was quite difficult as everything assumed I was in one of the clouds or on a single local laptop. I had used ray pre 1.0 release maybe 3 years ago and it was trivial to setup this use case back then.

I dont remember the specifics right now, but my problem may have been related to having a local kubernetes cluster that wasn't a cloud. I'd be happy to dig through details in my notes if its worthwhile feedback, just reach out.

  • (Signal boosting this. I, too, found it incredibly difficult to set up Ray for my purposes because of all the cloud provider assumptions.)

Is there a plan to tighter integrate into k8s, potentially in a multi-cluster/federated setting. It's a lot easier to get buy-ins for ray adoption from infra teams where k8s is the centralized compute substrate.