← Back to context

Comment by sillysaurusx

4 years ago

> Our goal with Ray is to make distributed computing as easy as possible. To do that, we think the serverless direction, which allows people to just focus on their code and not on infrastructure, is very important.

I watched https://now.sh/ deteriorate from a simple, lovely CLI into a dystopian mess due to their push for serverless. They abandoned all other approaches and forced people to use it. Far from making things easier, it became a kafkaesque pipeline of dependencies and configuration settings just to get any small example deployed.

Things may be better now, but the experience was so jarring and offputting that I haven't used now.sh for much of anything. Used to use it for everything; somehow https://docs.ycombinator.lol/ is still running, which was a static site deployed back before their serverless stuff.

I don't know. You might be right. But just remember, the Ray library -- the actual python lib -- is your bread and butter. It's why everyone loves you. I urge you, never make the mistake of letting it deteriorate. It should be rock solid for everyone forever, with no need to interface with any of your serverless components. The day you try to monetize by trying to sneak in "value adds" by making the code "easy to integrate with your serverless stuff" is the day that you open yourself to bugs, and the temptation to ignore problems in other areas -- because after all, the serverless infra would be where you're making your money, so it makes sense to push everyone in that direction.

Ray is so excellent right now that it feels like a sports car. I hope it'll stay excellent for a decade to come. (All I want is the ability to recover from client failures in a way where, if there are tasks in flight, I can tell those tasks how to re-run once all the actors have reconnected. I'm sure there's already a way to do something like this; just haven't looked into the details quite yet.)

Best of luck, and thanks for the wonderful lib.

EDIT: https://www.anyscale.com/blog/the-ideal-foundation-for-a-gen... just gives me terrible feelings. Your best bet is to ignore me, because my gut is likely wrong here -- at 33, I'm starting to fall past the hill. But for example:

> Ray hides servers

Suppose a hacker wants to build an iOS app powered by Ray. They want to create a cluster of servers to process incoming tasks. The tasks are things like "Make memes with AI," artbreeder-style, and then send them back to the iOS client waiting for them. Then the user can enjoy their meme, you throw up a "if you like memes, give me a dollar and you can have all the memes you want," and a million people download your AI meme app and you become the Zuckerbezos of AI.

In that context, no one wants to hide servers. No one I know -- anywhere -- thinks it's a good idea. We don't want to rely on your magic solutions. We want to keep our servers running. Because my servers happen to be TPUs, and there's no way that TPUs are ever going to become serverless. But even before I was using TPU VMs, all I wanted to do was to just stick GPUs onto servers and send results around; the serverless stuff gave me the creeps. Perhaps this just means I was ineffective, though, and that everyone I know is also ineffective.

I know, I know... you're going to support your non-serverless offerings, and Ray will be wonderful forever, and it'll be roses and rainbows. I hope so. But just don't let the core library become priority #2. It should be priority #1 forever.

Thanks for the comments! A few quick notes

The term serverless is a bit overloaded. Here's what we want. (1) Ray users should be able to focus only on their application logic and not have to worry about configuring clusters at least in the common case (this is not saying that they can't configure clusters if they want to). (2) In addition, we want Ray applications to be portable and to run on different size clusters with different instance types on different cloud providers or k8s or your laptop or anywhere. The application should be decoupled from the cluster configuration. (3) Great support for autoscaling Ray clusters. When a Ray application needs more resources of a certain type (CPUs, GPUs, memory, etc), those resources should be added to the cluster. When they are no longer needed, the cluster should scale down.

This is quite different from FaaS, though seems in line with the spirit of serverless. And these "serverless" properties are not something we think of as separate from Ray, but rather as part of just making Ray work better.

  • Yeah, we’ll have to agree to disagree that that’s a good idea.

    #3 is the crux of Ray’s power. I don’t understand why serverless is necessary. By default, I want to configure it to use my TPUs, which I control. I’m the one who creates or deletes TPUs — because I have to be! There’s no “TPU creation service” that can autoscale TPUs on demand.

    Maybe there will be one day, but not today. And focusing on today will help you long term, because it’s a very real need that people actually have. It’s hard to predict what people will want, other than to cater to existing needs. It’s how every great startup got their footing: build something people want.

    In that context, how could it be otherwise? When a TPU comes up, it connects to the ray cluster. That TPU cannot be shut down. I’m using it for training, for inference, for every reason I would be paying for. Interrupting those tasks isn’t something that I can “just be okay with.”

    I agree that inference tasks should be able to die occasionally. But training tasks, all the time? It’s not an abstraction — you can’t pretend that the training task can run on an arbitrary TPU. It needs to run on a specific TPU, because that TPU has the model loaded.

    What am I missing?

  • Just to add on in here, it may be fixed now but the most recent time I tried to use ray on my on-prem cluster, it was quite difficult as everything assumed I was in one of the clouds or on a single local laptop. I had used ray pre 1.0 release maybe 3 years ago and it was trivial to setup this use case back then.

    I dont remember the specifics right now, but my problem may have been related to having a local kubernetes cluster that wasn't a cloud. I'd be happy to dig through details in my notes if its worthwhile feedback, just reach out.

    • (Signal boosting this. I, too, found it incredibly difficult to set up Ray for my purposes because of all the cloud provider assumptions.)

  • Is there a plan to tighter integrate into k8s, potentially in a multi-cluster/federated setting. It's a lot easier to get buy-ins for ray adoption from infra teams where k8s is the centralized compute substrate.