The status page [1] has the actual root cause (enabling "Surrogate Keys" silently bypassed their CDN-off logic). The blog post doesn't. That's backwards.
"0.05% of domains" is a vanity metric -- what matters is how many requests were mis-served cross-user. "Cache-Control was respected where provided" is technically true but misleading when most apps don't set it because CDN was off. The status page is more honest here too: they confirmed content without cache-control was cached.
They call it a "trust boundary violation" in the last line but the rest of the post reads like a press release. No accounting of what data was actually exposed.
Appreciate the feedback. We got some feedback previously that things were "too technical" and not acknowledging it from the what users saw.
I've gone ahead and re-added the surrogate keys statement to the press release. Thank you for the feedback and if there's other things that you believe can be better please let me know!
I'm kinda shocked (yet not surprised) at how bad railway has been with this:
- Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
- During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
- They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only,
many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
- Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Honestly, I don't think railway is cut out for real production work (let alone compliance deployments), at least nothing beyond hobby projects.
Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
I was affected and got no communication at all, had to find out from user reports and take immediate action with 0 signal from railway about the issue (even though they were already aware according to the timeline).
I've been trying to defend railway since we built our initial prototype there and I wanted to avoid the cost of migrating to some "serious infra" until proven needed, but they have been making their defense a really hard job (without mentioning that their overall reliability has been really bad the past weeks)
> Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
> During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
> They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure
> Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
> Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
There are team members in that thread linked, are you certain you linked the right thread? Happy to have a look at anything you believe we're missing!
I'm sorry, but there's a lot of spin here. Basically you guys handled this terribly, and your reliability has tanked recently, hence why customers that need reliability in production are leaving or have already migrated.
> We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.
> Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
> Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
Still waiting on a reply and the logs so I can do forensics on this incident. IMO the response from Railway should have been: "all hands on deck, red alert, worst imaginable security breach for a PaaS". Not a small yellow alert popup about a CDN misconfiguration, and saying that all affected customers have been emailed, which is demonstrably not correct.
These incidents are a perfect example of how misleading "simple" systems can be.
From the outside, it looks like "just a cache misconfiguration," but in reality, the problem is more insidious because it's distributed across multiple layers:
- application logic (authentication limitations)
- CDN behavior -> infrastructure
- default settings that users rely on (no cache headers because the CDN was disabled)
The hardest part of debugging these cases isn't identifying what happened, but realizing where the model is flawed:
everything appears correct locally, the logs don't report any issues, yet users see completely different data.
I've seen similar cases where developers spent hours debugging the application layer before even considering that something upstream was silently changing the behavior.
These are the kind of incidents where the debugging path is anything but linear.
This write up doesn’t make sense. Authenticated users are the ones without a Set-Cookie? Surely the ones with the cookie set are the authenticated ones?
There are dozens of contradictions, like first they say:
“this may have resulted in potentially authenticated data being served to unauthenticated users”
and then just a few sentences later say
“potentially unauthenticated data is served to authenticated users”
which is the opposite. Which one is it?
Am I missing something, or is this article poorly reviewed?
It appears that your company experienced an incident during which a blog entry was made available in which readers became informed about certain information about a server condition that resulted in certain users receiving a barrage of indirect clauses etc. etc. etc.
Be more direct. Be concise. This blog post sounds like a cagey customer service CYA response. It defeats the purpose of publishing a blog post showing that you’re mature, aware, accountable, and transparent.
The problem is that these visible errors make us wonder what other errors in the post are less visible. Fixing them doesn’t fix the process that led to them.
Almost three years ago now, Railway poached one of our smartest engineers. They were smart to do so. I have a lot of respect for the Railway team and I’m impressed with their execution.
I think this is their first major security incident. Good that they are transparent about it.
If possible (@justjake) it would be helpful to understand if there was a QA/test process before the release was pushed. I presume there was, so the question is why this was not caught. Was this just an untested part of the codebase?
I am a big railway supporter and will continue to be. I run an agency and host many projects on the platform and will continue to do so. However, I never received an email and proactive notification about the incident. I hope the comms are better in the future. Best of luck with everything Jake!
Does Stripe use Railway? The dashboard was down today and this is the only incident report I've encountered and the timeline matches Stripe's downtime.
The status page [1] has the actual root cause (enabling "Surrogate Keys" silently bypassed their CDN-off logic). The blog post doesn't. That's backwards.
"0.05% of domains" is a vanity metric -- what matters is how many requests were mis-served cross-user. "Cache-Control was respected where provided" is technically true but misleading when most apps don't set it because CDN was off. The status page is more honest here too: they confirmed content without cache-control was cached.
They call it a "trust boundary violation" in the last line but the rest of the post reads like a press release. No accounting of what data was actually exposed.
[1] https://status.railway.com/incident/X0Q39H56
Appreciate the feedback. We got some feedback previously that things were "too technical" and not acknowledging it from the what users saw.
I've gone ahead and re-added the surrogate keys statement to the press release. Thank you for the feedback and if there's other things that you believe can be better please let me know!
I'm kinda shocked (yet not surprised) at how bad railway has been with this:
- Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
- During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
- They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
- Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Honestly, I don't think railway is cut out for real production work (let alone compliance deployments), at least nothing beyond hobby projects.
Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
https://station.railway.com/questions/data-getting-cached-or...
I was affected and got no communication at all, had to find out from user reports and take immediate action with 0 signal from railway about the issue (even though they were already aware according to the timeline).
I've been trying to defend railway since we built our initial prototype there and I wanted to avoid the cost of migrating to some "serious infra" until proven needed, but they have been making their defense a really hard job (without mentioning that their overall reliability has been really bad the past weeks)
Yeah, this was really the nail in the coffin for us. Most services are already moved from Railway, but the rest will follow during this week.
Railway founder here, providing some color
> Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
> During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
> They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure
> Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
> Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
There are team members in that thread linked, are you certain you linked the right thread? Happy to have a look at anything you believe we're missing!
I'm sorry, but there's a lot of spin here. Basically you guys handled this terribly, and your reliability has tanked recently, hence why customers that need reliability in production are leaving or have already migrated.
> We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.
> Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
Please read the room, there's still a lot of confusion about the blog post in this thread (> We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure
Emailing only affected users isn't working out, because affected people aren't yet emailed (I know one personally). Just check the post on your own forum (https://station.railway.com/questions/data-getting-cached-or... did you actually read it?) and see the list of people affected still not emailed, and left on read. You guy should email everyone, this is a security incident not a service interruption. There's a lot of loss trust by your customers now, i.e., if you guys can't figure out who to email, what else are you doing wrong?
> Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
3 replies →
Still waiting on a reply and the logs so I can do forensics on this incident. IMO the response from Railway should have been: "all hands on deck, red alert, worst imaginable security breach for a PaaS". Not a small yellow alert popup about a CDN misconfiguration, and saying that all affected customers have been emailed, which is demonstrably not correct.
These incidents are a perfect example of how misleading "simple" systems can be.
From the outside, it looks like "just a cache misconfiguration," but in reality, the problem is more insidious because it's distributed across multiple layers: - application logic (authentication limitations) - CDN behavior -> infrastructure - default settings that users rely on (no cache headers because the CDN was disabled)
The hardest part of debugging these cases isn't identifying what happened, but realizing where the model is flawed: everything appears correct locally, the logs don't report any issues, yet users see completely different data.
I've seen similar cases where developers spent hours debugging the application layer before even considering that something upstream was silently changing the behavior.
These are the kind of incidents where the debugging path is anything but linear.
This write up doesn’t make sense. Authenticated users are the ones without a Set-Cookie? Surely the ones with the cookie set are the authenticated ones?
There are dozens of contradictions, like first they say:
“this may have resulted in potentially authenticated data being served to unauthenticated users”
and then just a few sentences later say
“potentially unauthenticated data is served to authenticated users”
which is the opposite. Which one is it?
Am I missing something, or is this article poorly reviewed?
Fixed the typo in that second paragraph and aligned the section on the Set-Cookie stuff. Anything else that can be made more clear?
It appears that your company experienced an incident during which a blog entry was made available in which readers became informed about certain information about a server condition that resulted in certain users receiving a barrage of indirect clauses etc. etc. etc.
Be more direct. Be concise. This blog post sounds like a cagey customer service CYA response. It defeats the purpose of publishing a blog post showing that you’re mature, aware, accountable, and transparent.
The problem is that these visible errors make us wonder what other errors in the post are less visible. Fixing them doesn’t fix the process that led to them.
11 replies →
pretty hard to find this on their blog, looks like incidents are tucked away at the bottom. an issue of this size deserve a higher spot.
(also looks like two versions of the 'postmortem' are published at https://blog.railway.com/engineering)
Almost three years ago now, Railway poached one of our smartest engineers. They were smart to do so. I have a lot of respect for the Railway team and I’m impressed with their execution.
I think this is their first major security incident. Good that they are transparent about it.
If possible (@justjake) it would be helpful to understand if there was a QA/test process before the release was pushed. I presume there was, so the question is why this was not caught. Was this just an untested part of the codebase?
We indeed run tests as well as stage releases. For this issue, when rubber met road in production, we saw cases which weren't visible in staging.
We've rolled out some changes (detailed in the blogpost) which should avoid this in the future. Deepest apologies
I am a big railway supporter and will continue to be. I run an agency and host many projects on the platform and will continue to do so. However, I never received an email and proactive notification about the incident. I hope the comms are better in the future. Best of luck with everything Jake!
I'm curious if having unique URLs per user session would mitigate this.
I think that's already best practice in most API designs anyway?
Probably.
No, it isn't. Ive not seen this in an API ever and only in webapps ?phpsessid= back in childhood
Does Stripe use Railway? The dashboard was down today and this is the only incident report I've encountered and the timeline matches Stripe's downtime.
[flagged]
[dead]
[dead]
[flagged]
[dead]