Comment by liampulles

1 hour ago

I have been a part of a team that actually managed to significantly reduce critical tech debt in its system, to the point of background radiation. I can speculate on what I think were key contributing factors (some of which are just productivity improvements, which meant we had more bandwidth for tech debt):

* The team used a monorepo for (nearly) all its code. The upshots of this include the ability to enforce contracts between services all in one commit, the ability to make and review cross-cutting changes all in one PR, the increased flexibility in making large-scale architecture changes, and an easier time making automations and tools which work across the system.

* We used Go, which turned out to be a really excellent fit for working within a monorepo and a large-ish codebase. Also, having the Go philosophy to lean back on in a lot of code decisions, which favors a plain and clear style, worked out well (IMO). And its great for making CLI tools, especially ones which need to concurrently chew through a big data dump.

* Our team was responsible for integrations, and we took as a first principle that synchronous commands to our API would be the rare exception. Being async-first allowed us to cater for a lot of load by spreading it out over time, rather than scaling up instances (and dealing with synchronization/timing/load explosion issues).

* We converted the bulk of our microservices into a stateless monolith. Our scalability did not suffer much, because the final Go container is still just a couple MB, and we can still easily and cheaply scale instances up when we need. But being able to just make and call a function in a domain, rather than making an api and calling another service (and dealing with issues thereof), is so much easier.

* Our team was small - for most of when I was involved, it consisted of 3 developers. Its pretty easy to talk about code stuff and make decisions if you only have to discuss it with 2 other people.

* All of us developers were open to differing ideas, and generally speaking the person who cared the most about something could go and try it. If it didn't work, there would be no love lost in replacing it later.

* We had a relatively simple architecture that was enforced generally but not stringently. What I mean by that is that issues could be identified in code review, but the issue would be a suggestion and not a blocker. Either the person changes it and its fine, or they don't, in which case you could go and change it later if you still really cared about it.

* We benefited from having some early high-impact wins in terms of productivity improvements, and we used a lot of the spare sprint time to tackle ongoing tech debt, rather than accelerate feature work (but not totally, the business gets some wins too).

* Big tech debt endeavors were discussed and planned in advance with the whole team, and we made dilligent little chips at these problems for months. Once an issue was chipped away enough to not be painful anymore, then it didn't get worked on (getting microservices into the monolith, for example, died down as an issue once we refactored most of them).

* Tech debt items were prioritized by a ranked vote made by everyone, using a tool I built: https://github.com/liampulles/go-condorcet. This did well to ensure that everyone got the opportunity to have something they cared about, get tackled. Often times our votes were very similar, which means we avoided needless arguments when we actually agreed, and recognized a common understanding. I think this contributed to continued engagement from the team on the whole enterprise.

* Our tech stack was boring and reliable, which was basically Postgres, Redis, and NATS. Though NATS did present a few issues in getting the config right (and indeed, its the least boring piece). We also used Kubernetes, which is far from boring, but we benefited from having a few people who really understood it well.

* We built a release tool CLI, and built reasonably good general error alerting for our system (based on logs mostly, but with some sentry and infra alerts as well), that made releasing things become easy. This generally increased productivity, but also meant that more releases were small releases, and were easier to revert if there were issues.

* We had a fantastic PM, who really partnered with us on the enterprise and worked hard to make our project actually Agile, even though the rest of the business was not technical.