Comment by justjake

22 days ago

Railway founder here, providing some color

> Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.

We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.

> During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.

Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.

> They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.

We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure

> Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.

Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!

> Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team

There are team members in that thread linked, are you certain you linked the right thread? Happy to have a look at anything you believe we're missing!

I'm sorry, but there's a lot of spin here. Basically you guys handled this terribly, and your reliability has tanked recently, hence why customers that need reliability in production are leaving or have already migrated.

> We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.

Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.

> Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.

Please read the room, there's still a lot of confusion about the blog post in this thread (> We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure

Emailing only affected users isn't working out, because affected people aren't yet emailed (I know one personally). Just check the post on your own forum (https://station.railway.com/questions/data-getting-cached-or... did you actually read it?) and see the list of people affected still not emailed, and left on read. You guy should email everyone, this is a security incident not a service interruption. There's a lot of loss trust by your customers now, i.e., if you guys can't figure out who to email, what else are you doing wrong?

> Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!

  • Agreed 100%. So much downtime, constant minimising of situations. Can't be trusted. We are moving away from Railway.

  • > Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.

    We do indeed have a staging environment as mentioned previously. The issue arose in the rollout to production as mentioned previously.

    > The blog post reads like PR compared to the initial incident status report, and the resolved timestamp does not match which is sloppy.

    I've gone ahead and added the surrogate key mention into the post mortem. We initially got in trouble for having it be too technical centric and not enough on the user impact. It's a delicate balance; apologies. As I mention, we are open to critical feedback here.

    > Emailing only affected users isn't working out, because affected people aren't yet emailed (I know one personally). Just check the post on your own forum (https://station.railway.com/questions/data-getting-cached-or... did you actually read it?) and see the list of people affected still not emailed, and left on read.

    We have people working directly in that thread. For anybody who believes they were affected but not reached out to, we're working directly with them. We do take this very seriously. If you know someone here, please have them reach out either there or directly to me at jake@railway.com

    > Again, it's not an excuse if you're a _platform_ company that customers pay a lot of money to be reliable. You can't just keep saying you're open to feedback and being transparent as vanity.

    In the directly linked tweet I've mentioned that we're focusing on scaling the current system vs adding new features. We absolutely do need to do better on reliability, and my point is "Is there a specific poor engineering practice you're seeing here, or is it just based on reliability". Either is a fine crit we just want to make sure all our basis are covered

    > Did you read the thread? Yes, only _one_ employee commented 5 hours after my HN comment. Still almost everyone left of read, unanswered questions etc.

    Indeed I've read the thread, and we have people working it (you can see as of 8 hours ago).

    • > We do indeed have a staging environment as mentioned previously. The issue arose in the rollout to production as mentioned previously.

      You may have misunderstood, I said staged release, i.e., I'm referencing the rollout

      > I've gone ahead and added the surrogate key mention into the post mortem. We initially got in trouble for having it be too technical centric and not enough on the user impact. It's a delicate balance; apologies. As I mention, we are open to critical feedback here.

      You can do both. If you have different audiences, have two separate posts and mutually link to redirect audiences. Ask your sec staff instead of relying on paying customers to give post-hoc feedback on your dodgy disclosure practices. If I have ping a platform company to correct and clarify info about their security disclosure, I'm out.