Comment by 0xbadcafebee

1 day ago

Hi David, thanks for trying to fix the cloud. There is a persistent problem with all cloud providers that none of them has fixed yet (and I don't expect any ever will). I imagine users will not care about this issue, so this might not be worth solving. But if you'd like to have the only cloud provider (or technology in general) that can solve this problem, it would make cloud computers less annoying.

If you want to run a website in the cloud, you start with an API, right? A CRUD API with commands like "make me a VPC with subnet 1.2.3.4/24", "make me a VM with 2GB RAM and 1 vCPU", "allow tcp port 80 and 443 to my VM", etc. Over time you create and change more things; things work, everybody's happy. At some point, one of the things changes, and now the website is broken. You could use Terraform or Ansible to try to fix this, by first creating all the configs to hopefully be in the right state, then re-running the IaC to re-apply the right set of parameters. But your website is already down and you don't really want to maintain a complex config and tool.

You can't avoid this problem because the cloud's design is bad. The CRUD method works at first to get things going. But eventually VMs stop, things get deleted, parameters of resources get changed. K8s was (partly) made to address this, with a declarative config and server which constantly "fixes" the resources back to the declared state. But K8s is hell because it uses a million abstractions to do a simple thing: ensure my stuff stays working. I should be able to point and click to set it up, and the cloud should remember it. Then if I try to change something like the security group, it should error saying "my dude, if you remove port 443 from the security group, your website will go down". Of course the cloud can't really know what will break what, unless the user defines their application's architecture. So the cloud should let the user define that architecture, have a server component that keeps ensuring everything's there and works, and stops people from footgunning themselves.

Everything that affects the user is a distributed system with mutable state. When that state changes, it can break something. So the system should continuously manage itself to fix issues that could break it. Part of that requires tracking dependencies, with guardrails to determine if a change might break something. Another part requires versioning the changes, so the user (or system) can easily roll back the whole system state to before it broke. This abstraction is complicated, but it's a solution to a complex problem: keeping the system working.

No cloud deals with this because it's too hard. But your cloud is extremely simple, so it might work. Ideally, every resource in your cloud (exe.dev) should work this way. From your team membership settings, to whether a proxy is public, the state of your VM, your DNS settings, the ssh keys allowed, email settings, http proxy integration / repo integration settings / their attachments, VM tags & disk sizes, etc. Over time your system will add more pieces and get more complex, to the point that implementing these system protections will be too complex and you won't even consider it. But your system is small right now, so you might be able to get it working. The end result should be less pain for the user because the system protects them from pain (fixing broken things, preventing breaking things), and more money for you because people like systems that don't break. But it's also possible nobody cares about this stuff until the system gets really big, so maybe your users won't care. It would be nice to have a cloud that fixes this tho.