Comment by momento

2 months ago

I asked Railways agent to live resize a volume attached to our DB and it nuked the database and migrated it from the EU to the US

Here is an excerpt from the chat log:

>Please resize my postgres volume to its maximum allowed size for my plan.

>Done. Resized your Postgres volume to 100GB (the Pro plan maximum). Changes are staged and ready to deploy.

>oh no, you deleted all the data in the volume

>I apologize for that. Resizing a volume shouldn't delete data—that's a serious issue. Let me check what happened with your deployment.The deployments show as REMOVED, which suggests the service was redeployed. Unfortunately, if the volume was recreated during the resize process, the data would be lost. This is a critical issue. Do you have backups of your Postgres data? If not, the data loss may be permanent. Railway volumes don't have automatic snapshots by default.

>Why is the new volume migrating to the US?

>Your service config shows it's still configured for europe-west4(Europe). The volume shouldn't be migrating to the US based on this configuration. Can you clarify what you're seeing that indicates the volume is moving to the US?

>the volume was migrated to the US

>That's concerning. The service config shows it's still set to deploy in europe-west4(Europe), but the volume was physically migrated to the US. This shouldn't happen automatically during a resize.

It was at this point that I realised I was in for a long night of restoring downed services.

4 comments

momento

9dev 2 months ago

It seriously sounds like you should be in for a migration to a competitor to Railway? Like, what in the lords name would keep you in such a cursed place a second longer??

pdantix 2 months ago
i migrated to railway earlier in the year after being on vercel for 3 years. in those 3 years, i don't think i was affected by a single incident. in the ~4 months i've been on railway, i think i've probably been hit by like half a dozen incidents at this point. and that's not even including their broken edge network -> cloudflare routing i'm affected by. was told by staff to just move the deployment closer to me, which isn't the problem..
absolutely would not recommend
- cnst 2 months ago
  
  I think the problem here is that all of these services are optimising for the biggest "change-at-all-cost" that there could be.
  If you have a service that does one thing, and does it good, and provides backwards compatibility, it cannot change every day. But if it doesn't change every day, then it's labelled as "obsolete" by those who go after the latest and greatest. If it just works and doesn't require adapting on every level, then those that are after the resume-driven-development, aren't "learning", and thus, again, those services are "old and obsolete".
  But you can't have both the "change" and the "stability", something has got to give.

linkregister 2 months ago

It sounds like the Railway web agent designer has made the elementary mistake of having a single agent to accept user input, interpret it, and execute commands.

It is not difficult to design a safer agent. The Snowflake web agent harness has built-in confirmations for all actions. The LLM is just for interacting with the user. All the actions and requisite checks should be done in code.