Comment by rossdavidh
3 days ago
It's a great article, until the end where they say what the solution would be. I'm afraid that the solution is: build something small, and use it in production before you add more features. If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level. There is no software development process which reliably produces software that works at scale without doing it small, and medium sized, first, and fixing what goes wrong before you go big.
> If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level.
At a large box retail chain (15 states, ~300 stores) I worked on a project to replace the POS system.
The original plan had us getting everything working (Ha!) and then deploying it out to stores and then ending up with the two oddball "stores". The company cafeteria and surplus store were technically stores in that they had all the same setup and processes but were odd.
When the team that I was on was brought into this project, we flipped that around and first deployed to those two several months ahead of the schedule to deploy to the regular stores.
In particular, the surplus store had a few dozen transactions a day. If anything broke, you could do reconciliation by hand. The cafeteria had single register transaction volume that surpassed a surplus store on most any other day. Furthermore, all of its transactions were payroll deductions (swipe your badge rather than credit card or cash). This meant that if anything went wrong there we weren't in trouble with PCI and could debit and credit accounts.
Ultimately, we made our deadline to get things out to stores. We did have one nasty bug that showed up in late October (or was it early November?) with repackaging counts (if a box of 6 was $24 and if purchased as a single item it was $4.50 ... but if you bought 6 single items it was "repackaged" to cost $24 rather than $27) which interacted with a BOGO sale. That bug resulted in absurd receipts with sales and discounts (the receipt showed you spent $10,000 but were discounted $9,976 ... and then the GMs got alerts that the store was not able to make payroll because of a $9,976 discount ... one of the devs pulled an all nighter to fix that one and it got pushed to the stores ).
I shudder to think about what would have happened if we had tried to push the POS system out to customer facing stores where the performance issues in the cafeteria where worked out first or if we had to reconcile transactions to hunt down incorrect tax calculations.
You could have, in principle, implemented the new system to be able to run in "dummy mode" alongside the existing system at regular stores, so that you see that it produces the 'same' results in terms of what the existing system is able to provide.
Which is to say, there is more than one approach to gradual deployment.
Not easily when issues of PCI get in there.
Things like the credit card reader (and magnetic ink reader for checks), different input device (sending the barcode scanner two two different systems), keyboard input (completely different screens and keyed entry) would have made those hardware problems also things that needed to be solved.
The old system was a DOS based one where a given set of Fkeys were used to switch between screens on a . Need to do hand entry of a SKU? That was F4 and then type the number. Need to do a search for the description of an item? That was F5. The keyboard was particular to that register setup and used an old school XT (5 pin DIN) plug. The new systems were much more modern linux boxes that used USB plugs. The mag strip reader was flashed to new screens (and the old ones were replaced).
For this situation, it wasn't something that we could send keyboard, scanner, and credit card events to another register.
29 replies →
From my experience a lot of the hardest problems in this space are either 1. edge cases or 2. integration-related and that makes them hard to validate across systems or draw boundaries around what's in the dummy mode. This type of parallel, live, full system integration test is hard to pull off.
1 reply →
Sounds good in theory but very few real world projects can afford to run with old system in parallel
>> We did have one nasty bug that showed up in late October (or was it early November?)
Having worked in Ecommerce & payment processing, where this weekend is treated like the Superbowl, birth of your first child and wedding day all rolled into one, a nasty POS bug at this time of year would be incredibly stressful!
After thinking back on it, I think this was earlyish October. The code hadn't frozen yet, but it was getting increasingly difficult. We were in the "this was deployed to about 1/3 of the stores - all within an 8 drive of the general office". The go/no-go decision for the rest of the stores in October was coming up (and people were reviewing backout procedures for those 100). One of the awkward parts was that marketing had a Black Friday sale that they really wanted to do (buy X, buy Y, get Z half price) that the old registers couldn't support. They wanted to get a "is this going?" so they could start printing the advertising flyers.
Incidentally, this bug resurfaced for the next five years in a different incarnation. Because it had that this department (it was with one sku) had sold $10M this week in October, the running average sales target the next year was MEAN($24k, $25k, $26k, $25k, $10M) ... and the department heads were doing a "you want me to sell how much?!"
This bug had only affected... maybe five stores (still maybe five too many). We were in the "this is the last™ build before all store deployment next week" territory. It did mess with that a bit too as the boxed up registers came with an additional step of "make sure to reboot the register after doing initial confirmation."
The setup teams had a pallet of computers delivered to the stores that were supposed to be "remove the old registers, put these registers in, swap mag strip readers, take that laptop there and run this software to configure the devices on each register." However, the build that the registers had was the buggy build. While that build likely wouldn't hit that bug (it required a particular sale to be active which was only at a few stores and had ended) it still was another step that they had to follow.
Aside: For all its clunkiness, Java Web Start was neat. In particular, it meant that instead of trying to push software to 5k registers (how do you push to registers that are powered off?), instead we'd push to 300 stores and from there JWS would check for an update each time it started up ( https://docs.oracle.com/javase/8/docs/technotes/guides/javaw... ). So instead of pushing to 5k registers, we'd have it pull from 'posupdate' on the local network when it rebooted.
There is no solution because these projects are not failing because of technical reasons.
They are failing because of political scheming and bunch of people wanting to have a finger in the pie - "trillions spent" - I guess no one would mind earning couple millions.
Then you have "important people" who want to be important and want to have an opinion on font size and that some button should be 12px to the right because they are "important" it doesn't matter for the project but they have to assert the dominance.
You have 2 or 3 companies working on a project? Great! now they will be throwing stuff over the fence to limit their own cost and blame others while trying to get away with as least work done cashing most money possible.
That is how sausage is made. Coming up with "reasonable approach" is not the solution because as soon as you get different suppliers, different departments you end up with power/money struggle.
> They are failing because of political scheming and bunch of people wanting to have a finger in the pie - "trillions spent" - I guess no one would mind earning couple millions.
Not (necessarily) wrong, but if you start small, Important People may not want to bother with something that is Unimportant and may leave things alone so something useful and working can get going. If you starting with an Important project then Important People will start circling it right away.
Even starting small isn't a surefire way to avoid that problem. They'll just show up once the thing gets big enough.
Witness how the web was once a funny little collection of nerds sharing stuff with each other. But once it got big enough that you could start making money off it, the important people showed up and started taking over. The web still has those odd little corners, but it's largely the domain of a small number of giant powerful corporations.
I don't think there is a silver bullet for dealing with egomaniacs who want infinite power. They seem to be a part of the human condition and dealing with them is part of the ticket price for having a society.
2 replies →
I guess for me important point is that it is not technical issue and we already have all technical tools/processes to do really big software projects.
Even if people dislike scrum, find Git complicated and don’t want to open up JIRA - these tools are not the problem, these tools help building loads of working software.
We as software engineers with devops can deliver great and complex projects and build great systems. Lots of businesses people don’t even understand how much in control we can be of the environments and code.
Yet developers/IT is there to be blamed. Like we should be ashamed, Uncle Bob will give lectures “how developers should be more professional”.
Yet I always find business people who are like children in the corn field.
With small difference business/sales guys are pushy and walk over engineering guys and engineers bend over and take the blame and business guys can always say “those IT kids playing with toys instead of doing real job”.
Political corruption is like environmental radiation: a viable fix is never 'just get rid of political corruption'*. It's an environmental constant that needs to be handled by an effective approach.
That said, parent's size- and scope-iterative approach also helps with corruption, because corruption metastasizes in the time between {specification} and {deliverable}.
Shrink that, by tying incremental payments to working systems at smaller scales, and you shrink the blast radius for failure.
That said, there are myriad other problems the approach creates (encouraging architectures that won't scale to the final system, promoting duct taped features on top of an existing system, vendor-to-vendor transitions if the system builder changes, etc).
But on the whole, the pros outweigh the cons... for projects controlled by a political process (either public or private).
That's why military procurement has essentially landed on spiral development (i.e. iterative demonstrated risk burn-down) as a meta-framework.
* Limit political corruption, to the extent possible in a cost efficient manner, sure
> There is no solution because these projects are not failing because of technical reasons.
There is no technical solution. There are systems and governance solutions, if the will is there to analyze and implement them.
That's what works for products, not software systems. Gradual growth inevitably results in loads of technical debt that is not paid off as Product adds more feature requests to deliver larger and larger sales contracts. Eventually you want to rewrite to deal with all the technical debt, but nobody has enough confidence to say what is in the codebase that's important to Product and what isn't, so everybody is afraid and frozen.
Scale is separately a Product and Engineering question. You are correct that you cannot scale a Product to delight many users without it first delighting a small group of users. But there are plenty of scaled Engineering systems that were designed from the beginning to reach massive scale. WhatsApp is probably the canonical example of something that was a rather simple Product with very highly scaled Engineering and it's how they were able to grow so much with such a small team.
> Gradual growth inevitably results in loads of technical debt.
Why is this stated as though it's some de facto software law? The argument is not whether it's possible to waterfall a massive software system. It clearly is possible, but the failure ratios have historically been sufficiently uncomfortable to give rise to entirely different (and evidently more successful) project development philosophies, especially when promoters were more sensitive to the massive sums involved (which in my opinion also helps explains why so many wasteful government examples). The lean startup did not appear in a vacuum. Do things that don't scale did not become a motto in these parts without reason. In case some are still confused about the historical purpose of these benign sounding advices, no, they weren't originally addressed at entrepreneurs aiming to run "lifestyle" businesses.
I think the logic is that good code is code which is maintainable and modifyable; bad code is difficult to change safely. Over time, all code is changed until it is bad code and cannot be changed more. So overtime most code is bad code which is scary to touch.
1 reply →
its not the law but a cost
software is unique field where project can be a problem that no matter how much money you throw, there is something that we can "improve" or make it better
that's why we start something small, a scope if you want call it that way
of course start something small or dare I call it simpler would result in more technical debt because that things its not designed with scale in mind because back to the first point
It is a law. The law of entropy.
Try as you might, you cannot fight entropy eternally, as mistakes in this fight will accumulate and overpower you. It's the natural process of aging we see in every lifeform.
The way life continues on despite this law is through reproduction. If you bud off independent organisms, an ecosystem can gain "eternal" life.
The cost is that you must devote much of your energy to effective reproduction.
In software, this means embracing rewrites. The people who push against rewrites and claim they're not necessary are just as delusional as those who think they can live forever.
3 replies →
Software is a component of a product, if not the product itself. Treating software like a product, besides being the underlying truth, also means it makes sense to manage it like one.
Technical debt isn’t usually the problem people think it is. When it does become a problem, it’s best to think of it in product-like terms. Does it make the product less useful for its intended purpose? Does it make maintenance or repair inconvenient or costly? Or does it make it more difficult or even impossible to add competitive features or improvements? Taking a product evaluation approach to the question can help you figure out what the right response is. Sometimes it’s no response at all.
The discussion is not about the product where you can just remove the stuff. The thread was testing in small setting and then moving to oddball setting. If it is required to cover oddball settings, it makes sense to know and plan for oddball setting.
Took me way too long to learn this. It still makes me sad to leave projects “imperfect” and not fiddle in my free time sometimes
Designing or intending a system to be used at massive scale is not the same as building and deploying it so that it only initially runs at that massive scale.
That's just a recipe for disaster, "We don't even know if we can handle 100 users, let's now force 1 million people to use the system simultaneously." Even WhatsApp couldn't handle hundreds of millions of users on the day it was first released, nor did it attempt to. You build out slowly and make sure things work, at least if you're competent and sane.
Sure, but if you did a good job, the gradual deployment can go relatively quickly and smoothly, which is how $FAANG roll out new features and products to very large audiences. The actual rollout is usually a bit of an implementation detail of what first needed to be architected to handle that larger scale.
8 replies →
No but whatsapp was built by 2 guys that had previously worked at Yahoo, and they picked a very strong tech for the backend: erlang.
So while they probably didn't bother scaling the service to millions in the first version, they 1) knew what it would take, 2) chose already from the ground up a good technology to have a smoother transition to your "X millions users". The step "X millions to XYZ millions and then billions" required other things too.
At least they didn't have to write a php-to-C++ compiler for Php like Facebook had, given the initial design choice of Mark Zuckeberg, which shows exactly what it means to begin something already with the right tool and ideas in mind.
But this takes skills.
9 replies →
> Gradual growth inevitably results in loads of technical debt that is not paid off as Product adds more feature requests to deliver larger and larger sales contracts.
This isn't technical debt, necessarily. Technical debt is a specific thing. You probably mean "an underlying design that doesn't perfectly map to what ended up being the requirements". But then the world moves on (what if a regulation is added that ruins your perfect structure anyway?) and you can't just wish for perfect requirements. Or not in software that interacts directly with the real world, anyway.
You have to design for scale AND deploy gradually
Yes, absolutely. Knowing that it will need to get big eventually is important, but not at all the same as deploying at scale initially.
Yes, it can be very difficult to add “scale” after the fact, once you already have a lot of data persisted in a certain way.
Gradual growth =/= many tacked on features. Many tacked on features =/= technical debt. Technical debt =/= "everybody is afraid and frozen." Those are merely often correlated, but not required.
Whatsapp is a terrible example because it's barely a product; Whatsapp is mostly a free offering of goodwill riding on the back of actual products like Facebook Ads. A great example would be a product like Salesforce, SAP, or Microsoft Dynamics. Those products are forced to grow and change and adapt and scale, to massive numbers doing tons of work, all while being actual products and being software systems. I think such products act as stark rebukes of what you've described.
There's nothing wrong with technical debt per se. As with all debt, the problem is incurring it without a plan or means to pay it off. Debt based financing is the engine of modern capitalism.
Gradual growth to large scale implies an ongoing refactoring cost--that's the price of paying off the technical debt that got you started and built initial success in small scale rollouts. As long as you keep "servicing" your debt (which can include throwing away an earlier chunk and building a more scalable replacement with the lessons learned), you're doing fine.
The magic words here to management/product owners is "we built it that way the first time because it got us running quickly and taught us what we need to know to build the scalable version. If we'd tried to go for the scalable version first, we wouldn't have known foo, bar and baz, and we'd have failed and wouldn't have learned anything."
we get paid to add to it, we don’t get paid to take away
Now there is your problem. It is only true in the context of grave incompetence, though. I have worked on tickets with 'remove' in the title.
[dead]
The dominant factor is: there is a human who understands the entire system.
That is vastly easier to achieve by making a small, successful system, which gets buy in from both users and builders to the extent that the former pay sufficient money for the latter to be invested in understanding the entire system and then growing it and keeping up with the changes.
Occasionally a moon shot program can overcome all of that inertia, but the “90% of all projects fail” is definitely overrepresented in large projects. And the Precautionary Principle says you shouldn’t because the consequences are so high.
This works for Clojure, git and even Linux. It seems there's a human who understands the entire system and decides what's allowed to be added to it. But these things are meant to be used by technical people.
The non-technical people I know might want to use Linux but stay on Windows or choose Mac OS because it's more straightforward. I use Windows+WSL at work even though I would like to use a native Linux distribution.
I know someone who created a MUD game (text online game) and said to him I wanted to make one with a browser client. He said something we could translate as "Good, you can have all the newbies." Not only was he right that a MUD should be played with a MUD client like tintin++, but making a good browser client is harder than it seems and that's time not spent making content for the game or improving the engine.
My point is that he was un uncomprimising person who refused adding layers to a project because they would come at a cost which isn't only time or dollars but also things like motivation and focus.
You’re conflating “knows the system” with benevolent dictator. It’s not the same. It’s down to whether in a planning or brainstorming session, there is anyone who can say that a plan won’t work or if there’s a better one.
Also it doesn’t have to be singular. You need at least one, in case that person leaves or becomes problematic. That dictator doesn’t always remain benevolent and they can hold a project hostage if they don’t like something that everyone else wants.
> ... even Linux. It seems there's a human who understands the entire system and decides what's allowed to be added to it.
I really wonder what will happen to Linux once Linus is no longer involved.
2 replies →
You will never get to the moon by making a faster and faster bus.
I see a lot of software with that initial small scale "baked into it" at every level of its design, from the database engine choice, schema, concurrency handling, internal architecture, and even the form design and layout.
The best-engineered software I've seen (and written) always started at the maximum scale, with at least a plan for handling future feature extensions.
As a random example, the CommVault backup software was developed in AT&T to deal with their enormous distributed scale, and it was the only decently scalable backup software I had ever used. It was a serious challenge with its competitors to run a mere report of last night's backup job status!
I also see a lot of "started small, grew too big" software make hundreds of silly little mistakes throughout, such as using drop-down controls for selecting users or groups. Works great for that mom & pop corner store customer with half a dozen accounts, fails miserably at orgs with half a million. Ripping that out and fixing it can be a decidedly non-trivial piece of work.
Similarly, cardinality in the database schema has really irritating exceptions that only turn up at the million or billion row scale and can be obscenely difficult to fix later. An example I'm familiar with is that the ISBN codes used to "uniquely" identify books are almost, but not quite unique. There are a handful of duplicates, and yes, they turn up in real libraries. This means that if you used these as a primary key somewhere... bzzt... start over from the beginning with something else!
There is no way to prepare for this if you start with indexing the book on your own bookshelf. Whatever you cook up will fail at scale and will need a rethink.
Counterpoint: the idea that your project will be the one to scale up to the millions of users/requests/etc is hubris. Odds are, your project won't scale past a scale of 10,000 to 100,000. Designing every project to scale to the millions from the beginning often leads to overengineering, adding needless complexity when a simpler solution would have worked better.
Naturally, that advice doesn't hold if you know ahead of time that the project is going to be deployed at massive scale. In which case, go ahead and implement your database replication, load balancing, and failover from the start. But if you're designing an app for internal use at your company of 500, well, feel free to just use SQLite as your database. You won't ever run into the problems of scale in this app, and single-file databases have unique advantages when your scale is small.
Basically: know when huge scale is likely, and when it's immensely UNlikely. Design accordingly.
> Odds are, your project won't scale past a scale of 10,000 to 100,000.
That may be a self-fulfilling prophecy.
I agree in general that most apps don't need fancy scaling features, but apps that can't scale... won't... and hence "don't need scaling features".
> You won't ever run into the problems of scale in this app, and single-file databases have unique advantages when your scale is small.
I saw a customer start off with essentially a single small warehouse selling I dunno... widgets or something... and then the corporation grew and grew to a multi-national shipping and logistic corporation. They were saddled with an obscure proprietary database that worked like SQLite and had incredibly difficult to overcome technical challenges. They couldn't just migrate off, because that would have needed a massive many-year long total rewrite of their app.
For one performance issue we were seriously trying to convince them to use phase-change cooling on frequency-optimized server CPUs like a gamer overclocking their rig because that was the only way to eke out just enough performance to ensure their overnight backups didn't run into the morning busy time.
That's just not an issue with SQL Server or any similar standard client-server database engine.
2 replies →
You can by making a bigger and bigger rocket though.
While I think this is good advice in general, I don’t think your statement that “there is no process to create scalable software” holds true.
The uk gov development service reliably implements huge systems over and over again, and those systems go out to tens of millions from day 1. As a rule of thumb, the parts of the uk govt digital suite that suck are the parts the development service haven’t been assigned to yet.
The Swift banking org launches reliable features to hundreds of millions of users.
There’s honestly loads of instances of organisations reliably implementing robust and scalable software without starting with tens of users.
The uk government development service as you call it is not a service. It’s more of a declaration of process that is adhered to across diverse departments and organisations that make up the government. It’s usually small teams that are responsible for exploring what a service is or needs and then implementing it. They are able to deliver decent services because they start small, design and user test iteratively and only when there is a really good understanding of what’s being delivered do they scale out. The technology is the easy bit.
The UK Gov has many service and process docs [1]. It started out that way but has grown rapidly and changed. Including a library for authentication, frontend templates and libraries, custom docker images.
[1]: https://github.com/alphagov
UK GDS is great, but the point there is that they're a crack team of project managers.
People complain about junior developers who pass a hiring screen and then can't write a single line of code. The equivalent exists for both project management and management in general, except it's much harder to spot in advance. Plus there's simply a lot of bad doctrine and "vibes management" going on.
("Vibes management": you give a prompt to your employees vaguely describing a desired outcome and then keep trying to correct it into what you actually wanted)
> and those systems go out to tens of millions from day 1
I like GDS (I even interviewed with them once and saw their dev process etc) but this isn't a great example. Technically GDS services have millions of users across decades, but people e.g. aren't constantly applying for new passports every day.
A much better example I think is Facebook's rollout of Messenger, which scaled to billions of actual users on day 1 with no issues. They did it by shipping the code early in the Facebook app, and getting it to send test messages to other apps until the infra held, and then they released Messenger after that. Great test strategy.
GDS's budget is about £90 million a year or something. There are many contracts that are still spent on digital, for example PA consulting for £60 million (over a few years) which do a lot of the gov.uk home-office stuff, and their fresh grads they hire cost more to the government than GDS's most senior staff...
SWIFT? Hold my beer. SWIFT did not launch anything substantial since its startup days in early 70-ies.
Moreover, their core tech did not evolve that far from that era, and the 70-ies tech bros are still there through their progeniture.
Here's an anecdote: The first messaging system built by SWIFT was text-based, somewhat similar to ASN.1.
The next one used XML, as it was the fad of the day. Unfortunately, neither SWIFT nor the banks could handle 2-3 orders of magnitude increase in payload size in their ancient systems. Yes, as engineers, you would think compressing XML would solve the problem and you would by right. Moreover, XML Infoset already existed, and it defined compression as a function of the XML Schema, so it was somewhat more deterministic even though not more efficient than LZMA.
But the suits decided differently. At one of the SIBOS conferences they abbreviate XML tags, and did it literally on paper and without thinking about back-and-forth translation, dupes, etc.
And this is how we landed with ISO20022 abberviations that we all know and love: Ccy for Currency, Pmt for Payment, Dt for Date, etc.
Harder to audit when every payload needs to be decompressed to be inspected
1 reply →
> https://www.amazon.com/How-Big-Things-Get-Done/dp/0593239512
This is what https://www.amazon.com/How-Big-Things-Get-Done/dp/0593239512 advocates too: start small, modularize, and then scale. The example of Tesla's mega factory was particular enticing.
> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.
Gall’s law wins again.
> I'm afraid that the solution is: build something small, and use it in production before you add more features.
Gall's Law:
> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.[8]
* https://en.wikipedia.org/wiki/John_Gall_(author)#Gall's_law
Came here to say this. I still think that Linus Torvalds has the most profound advice to building a large, highly successful software system:
"Nobody should start to undertake a large project. You start with a small trivial project, and you should never expect it to get large. If you do, you'll just overdesign and generally think it is more important than it likely is at that stage. Or worse, you might be scared away by the sheer size of the work you envision. So start small, and think about the details. Don't think about some big picture and fancy design. If it doesn't solve some fairly immediate need, it's almost certainly over-designed. And don't expect people to jump in and help you. That's not how these things work. You need to get something half-way useful first, and then others will say "hey, that almost works for me", and they'll get involved in the project."
-- Linux Times, October 2004.
I don't think this applies in any way to companies contracted to build a massive system for a government with a clear need. Linus is talking about growing a greenfield open-source project, which may or may not ever be used by anyone.
In contrast, if your purpose is "we need to manage our country's accounting without pen and paper", that's a clear need for a massive system. Starting work on this by designing a system that can solve accounting for a small firm is not the right way to go. Instead, you have to design with the end-goal in mind, since that's what you were paid for. But, you don't launch your system to the entire country at once: you first use this system designed for a country in a small shop, to make sure it actually handles the small scale well, before gradually rolling out to more and more people.
> for a government with a clear need.
There's your problem. The needs are never clear, not on massive systems. Governments will write a spec, companies will read the spec, offer to implement it as written, knowing full well that it won't work. Then they charge exorbitant fees to modify the system after launch, so that it will actually full fill business needs.
The Danish government is famous for sucking at buying massive IT systems.
Building those systems is a long term project, and you have to start small with a minimum number of functions, scope creep on those initial use cases often kills these kinds of projects.
No Linus Torvalds would stand against people in projects from article, he would slam the door and quit.
Those projects that author pointed out are basically political horror stories. I can imagine how dozens of people wanted to have a cut on money in those projects or wanted to push things because “they are important people”.
There is nothing you can do technically to save such projects and it is NOT an IT failure.
This is a really dense paragraph of lifetime-accumulated wisdom in that single quote.
Works with implementations and not APIs though.
A bad API can constrain your implementation and often can't be changed once it's in use by loads of users. APIs should be right from day one if possible.
I would add the nuance that the possibility of controlled migration from one versioned API to another should be right from day one, not necessarily the first API version.
While I like the "start small and expand" strategy better than the "big project upfront", this trades project size for project length and often that is no better:
- It gives outside leadership types many more opportunities to add requirements later. This is nice is they are things missed in the original design, but it can also lead to massive scope creep.
- A big enough project that gets done the "start small and expand" way can easily grow into a decade-plus project. For an extreme example, see the multi-decade project by the Indian rail company to gradually replace all its railways to standard gauge. It works fine if you have the organisational backing for a long duration, but the constant knowledge leaks from people leaving, retiring, getting promoted, etc can be a real problem for a project like that. Especially in fields where the knowledge is the product, like in software.
- Not every project can feasibly start small.
> If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level.
You could also try to buy some off-the-shelf solutions? Making payroll, even for very large organisations, isn't exactly a new problem.
As a corollary I would also suggest: subsidiarity.
> Subsidiarity is a principle of social organization that holds that social and political issues should be dealt with at the most immediate or local level that is consistent with their resolution.
(from https://en.wikipedia.org/wiki/Subsidiarity)
If you solve more problems more locally, you don't need that many people at the national level, thus making payroll there is easier.
I think you'll find that is exactly what people do. However, payroll solutions are highly customized for every individual company and even business unit. You don't buy a payroll software in a box, deploy it, and now you have payroll. Instead, you pay a payroll software company, they come in and get information about your payroll systems, and then they roll out their software on some of your systems and work with you to make sure their customizations worked etc. There's rarely any truly "off-the-shelf" software in B2B transactions, especially the type of end-user solutions that also interact with legal systems.
Also, governments are typically at least an order of magnitude larger than the largest companies operating in their countries, in terms of employees. So sure, the government of Liechtenstein has fewer employees than Google overall, but the US government certainly does not, and even Liechtenstein probably has way more government employees than Google employees in their country.
I work at a small shop, I'm a big advocate of giving customers the 0.1 version and then talking it out what they want. It's often not exactly what they asked for at the start ... but it often is better in the end.
It's hard to hit the target right the first time.
Yes. Also the same applies to companies. There should not be companies that are growing to $100 million revenue while losing money on a gamble that they will eventually get big enough to succeed. Good first, big later.
$100M maybe. But pretty much all tech needs an initial investment before you can start making profit. It takes a lot of development before you can get a product that anyone would want to pay for.
>It's a great article, until the end where they say what the solution would be. I'm afraid that the solution is: build something small, and use it in production before you add more features.
I think that is true for a lot of projects. But I'm not sure it is realistic to incrementally develop a control system for a nuclear reactor or an air traffic control system.
See also Gall's Law:
"All complex systems that work evolved from simpler systems that worked"
Not saying you're wrong, but I wonder what is the differentiating factor for software? We can build huge things like airliners, massive bridges and buildings without starting small.
Incremental makes less sense to me when you want to go to mars. Would you propose to write the software for such a mission in an incremental fashion too?
Yet for software systems it is sometimes proposed as the best way.
> We can build huge things like airliners, massive bridges and buildings without starting small.
We did start small with all of those things. We developed rigorous disciplines around engineering, architecture, material sciences. And people died along the way in the thousands[0][1]
People are still dying from those failures; The Boeing 737 MAX 9 crash was only two years ago.
> Incremental makes less sense to me when you want to go to mars.
This is yet another reason why a manned Mars mission will be exceedingly dangerous NOT a strike against incremental development and deployment.
[0] https://en.wikipedia.org/wiki/List_of_building_and_structure...
[1] https://en.wikipedia.org/wiki/List_of_accidents_and_incident...
All of the things you mentioned are designed and tested incrementally. Furthermore software has been used on Mars missions in the past, and that software was also developed incrementally. It’s proposed as the best way because it’s a way that has a track record
> All of the things you mentioned are designed and tested incrementally.
In a different way that what is proposed in this thread. We don't build a small bridge and grow it. We build small bridges, develop a theory for building bridges and use that to design the big bridge.
I don't know of any theory of computing that would help us design a "big" program at once.
That sounds like the way nature handles growth and complexity: slowly and over long time scales. Assume there will be failures, don't die and keep trying.
When you bite off too much complexity at once you end up not shipping anything or building something brittle.
You just need: Plan -> Implement -> Test -> Repeat
Whether you are creating software, games or whatever, these iterations are foundational. How these steps look like in detail of course depends on the project itself.
That's the ideal, but a lot of these big problems can't start small because the problem they have is already big. A lot of government IT programs are set up to replace existing software and -processes, often combining a lot of legacy software's jobs and the manual labor involved.
If you have something like a tax office or payroll, they need to integrate decades of legislation and rules. It's doable, but you need to understand the problem (which at those scales is almost impossible to fit in one person's head) and more importantly have diligent processes and architecture to slowly build up and deploy the software.
tl;dr it's hard. I have no experience in anything that scale, I've been at the edges of large organizations (e.g. consumer facing front-ends) for most of my career.
The accounting, legal and business process requirements are vastly different at different scales, different jurisdictions, different countries, etc.
There's a crazy amount of complexity and customizability in systems like ERPs for multinational corporations (SAP, Oracle).
When you start with a small town, you'll have to throw most of everything away when moving to a different scale.
That's true for software systems in general. If major requirements are bolted on after the fact, instead of designed into the system from the beginning, you usually end up with an unmaintainable mess.
Knowing that the rules for your first small deployment are not the same as the rules for everywhere, is valuable for designing well. Trying to implement all of those sets of rules in your initial deployment, is not a good idea. There is a general principle that you shouldn't code the abstraction until you've coded for the concrete example 2 or 3 times, because otherwise you won't make the right abstraction. Looking ahead is not the same as starting with the whole enchilada for your initial deployment.
I do get concerned when the solution is to be more strict on the waterfall process.
I used to believe there were some worlds in which waterfalls are better: where requirements are well know in advance and set in stone. I’ve since come to realize neither of those assumptions is ever true.
What works at small scale possibly won't work at a huge scale.
But what hasn’t even been tried at a small scale definitely won’t work at a huge scale.
Which is absolutely true, and a reason to try at medium scale second. But what doesn't work at small scale, almost certainly won't work at huge scale.
Imagine if the only way to build a skyscraper was to start with a dollhouse and keep tacking extensions and pieces onto it until. Imagine if the only way to build a bridge across San Francisco bay was to start with pop sickle sticks.
The very specific example you chose: payroll, shows how it can be difficult to incrementally step from small to huge. As you grow from town to national, you will run into all the disadvantages without really hitting the advantages. I feel that incremental does help you move from one level to one just a few above. But only if there are enough customers at these starting levels exactly.
When developing for towns, you will have all small random subsets of the variations imposed by year after year of legal changes BUT small sales. You will have to implement niche variations in arbitrary aspects for all the towns you have to support AND you will not have the customer size on which to amortize this work. Each new customer will bring a new arbitrary set of legal aspects to be met. Each new customer may be arbitrarily difficult to support.
By the time you reach national, you will have already covered most of the historical legal quirks - but that will have been done in one kludgy manner after another - and then you will hit one more set of legal quirks at the level of national organizations (some of them will have their very own laws). You will now have a very large budget to finalize things but you will be burdened by an illogical software base.
So I agree that you will need experience and subject matter experts that have worked at the various levels. BUT, now that you have this experience you now know the degree of flexibility that is required (you know where and what needs to be variable and quirk-friendly and how far the quirks can go = "any size") as well as size-related issues (mailing, transaction, user support volume) and you can now plan for all this AS YOU restart a new development from scratch. Because at this new "master" level you need both systematic flexibility AND relience at size.
Payroll is exactly the kind of topic where "adding features" will be "fun" - I mean bewildering - while you learn, but probably economically difficult to manage, until it kills you "as you climb up"?
You will be killed by a large software project that can afford to hire out a bunch of your subject matter specialists (or hires new ones) and uses them in a "from scratch" project. If you are lucky, this large project will be from the same company but only if you are lucky.
Now. AFTER you have done the one top level project - for one country -, you will probably be in a good situation to sell service to all kinds of organizations. Because you now have a system in which you can implement ridiculous quirks without breaking everything. And if you have done the job just right, you can onboard smaller customers (towns) economically enough that they can afford your solution.
That's different from where you deploy your solution first. Sure, deploy a national-design solution first at a subset of the target employees - although that does impose more requirements still: now you need to coexist with the legacy solutions. Which would be another hard to meet handicap when developing for towns first.