Comment by echelon
18 hours ago
Logging, tracing, observability, and control plane (flags, etc.) should be open.
We built 100% in-house pieces for all of this at a major fintech a decade ago. Everything worked and single teams could manage these systems.
Someone in leadership said we had to get rid of all "weirdware". Open solutions weren't robust, so we went commerical.
SignalFX got acquired, immediately 10x'd our prices and put all hands on deck to migrate. Unscheduled, stressful, bullshit. We missed the migration date and had to pay anyway.
LaunchDarkly promised us the moon to replace the system my team built. It didn't work with Ruby, Go, and the Java client sucked. It couldn't sync online changes at runtime like our five nines distributed and fault tolerant system could. We had to upstream a ton of code. And their system still sucked by the time I left the project.
These systems need to be open and owned by us. Managed is okay, but they shouldn't be proprietary offerings.
I could extend that one step further to cloud itself, but that's an argument for another day.
> I could extend that one step further to cloud itself, but that's an argument for another day
Absolutely. OSS platforms like k8s got a long way. Openstack was the dream (deeply flawed in execution). If we want to seriously talk about resilience we can’t accept that almost all major clouds run proprietary systems and we just have to trust them that they’ll be around forever.
NIH syndrome isn't sustainable, unless you're like Google and have more money than sense.
> These systems need to be open and owned by us. Managed is okay, but they shouldn't be proprietary offerings.
You could say this about all software in the world, but good luck with that... people who make money off of making things and selling things are going to keep doing so in non-open ways, because it's advantageous. And customers will keep buying them, because it's better than the alternative.
My last place also rolled their own feature flag service as their business logic around users/orgs/segments didn't neatly match anything off-the-shelf. It did what it was meant to and worked fine. OTOH we used Datadog for telemetry, which was expensive but made sense since we didn't have enough headcount with the skills to support something self-hosted.
At the end of the day, you just need to make good decisions based on honest analysis of your needs, capabilities, and general context.
> NIH syndrome isn't sustainable, unless you're like Google and have more money than sense.
Control plane and observability are key concerns of a fintech handling billions in daily transaction volume.
We had teams building and managing our solutions. After the migrations, we had teams managing the integrations. The headcount didn't change, we just wound up paying external vendors and sequencing multiple provider moves and company wide migrations. The changes caused several outages and shifted OKRs.