Comment by shagie
3 days ago
> If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level.
At a large box retail chain (15 states, ~300 stores) I worked on a project to replace the POS system.
The original plan had us getting everything working (Ha!) and then deploying it out to stores and then ending up with the two oddball "stores". The company cafeteria and surplus store were technically stores in that they had all the same setup and processes but were odd.
When the team that I was on was brought into this project, we flipped that around and first deployed to those two several months ahead of the schedule to deploy to the regular stores.
In particular, the surplus store had a few dozen transactions a day. If anything broke, you could do reconciliation by hand. The cafeteria had single register transaction volume that surpassed a surplus store on most any other day. Furthermore, all of its transactions were payroll deductions (swipe your badge rather than credit card or cash). This meant that if anything went wrong there we weren't in trouble with PCI and could debit and credit accounts.
Ultimately, we made our deadline to get things out to stores. We did have one nasty bug that showed up in late October (or was it early November?) with repackaging counts (if a box of 6 was $24 and if purchased as a single item it was $4.50 ... but if you bought 6 single items it was "repackaged" to cost $24 rather than $27) which interacted with a BOGO sale. That bug resulted in absurd receipts with sales and discounts (the receipt showed you spent $10,000 but were discounted $9,976 ... and then the GMs got alerts that the store was not able to make payroll because of a $9,976 discount ... one of the devs pulled an all nighter to fix that one and it got pushed to the stores ).
I shudder to think about what would have happened if we had tried to push the POS system out to customer facing stores where the performance issues in the cafeteria where worked out first or if we had to reconcile transactions to hunt down incorrect tax calculations.
You could have, in principle, implemented the new system to be able to run in "dummy mode" alongside the existing system at regular stores, so that you see that it produces the 'same' results in terms of what the existing system is able to provide.
Which is to say, there is more than one approach to gradual deployment.
Not easily when issues of PCI get in there.
Things like the credit card reader (and magnetic ink reader for checks), different input device (sending the barcode scanner two two different systems), keyboard input (completely different screens and keyed entry) would have made those hardware problems also things that needed to be solved.
The old system was a DOS based one where a given set of Fkeys were used to switch between screens on a . Need to do hand entry of a SKU? That was F4 and then type the number. Need to do a search for the description of an item? That was F5. The keyboard was particular to that register setup and used an old school XT (5 pin DIN) plug. The new systems were much more modern linux boxes that used USB plugs. The mag strip reader was flashed to new screens (and the old ones were replaced).
For this situation, it wasn't something that we could send keyboard, scanner, and credit card events to another register.
What's PCI?
Sorry, I'm not familiar with all the acronyms.
27 replies →
[dead]
From my experience a lot of the hardest problems in this space are either 1. edge cases or 2. integration-related and that makes them hard to validate across systems or draw boundaries around what's in the dummy mode. This type of parallel, live, full system integration test is hard to pull off.
In 1997 I was working on an integration between AOL and Circuit City (ha I outlived them both) to enable free AOL accounts for people buying PCs or some such; about a week before launch I changed the data returned from encoding spaces as "+" to "%20" and broke their integration (perl script). Very upsetting for them, and I felt bad.
I also had some weird bug when we started registrations from German accounts and I didn't handle umlauts (or UTF-16 with nuls in the string) in passwords properly.
Sounds good in theory but very few real world projects can afford to run with old system in parallel
>> We did have one nasty bug that showed up in late October (or was it early November?)
Having worked in Ecommerce & payment processing, where this weekend is treated like the Superbowl, birth of your first child and wedding day all rolled into one, a nasty POS bug at this time of year would be incredibly stressful!
After thinking back on it, I think this was earlyish October. The code hadn't frozen yet, but it was getting increasingly difficult. We were in the "this was deployed to about 1/3 of the stores - all within an 8 drive of the general office". The go/no-go decision for the rest of the stores in October was coming up (and people were reviewing backout procedures for those 100). One of the awkward parts was that marketing had a Black Friday sale that they really wanted to do (buy X, buy Y, get Z half price) that the old registers couldn't support. They wanted to get a "is this going?" so they could start printing the advertising flyers.
Incidentally, this bug resurfaced for the next five years in a different incarnation. Because it had that this department (it was with one sku) had sold $10M this week in October, the running average sales target the next year was MEAN($24k, $25k, $26k, $25k, $10M) ... and the department heads were doing a "you want me to sell how much?!"
This bug had only affected... maybe five stores (still maybe five too many). We were in the "this is the last™ build before all store deployment next week" territory. It did mess with that a bit too as the boxed up registers came with an additional step of "make sure to reboot the register after doing initial confirmation."
The setup teams had a pallet of computers delivered to the stores that were supposed to be "remove the old registers, put these registers in, swap mag strip readers, take that laptop there and run this software to configure the devices on each register." However, the build that the registers had was the buggy build. While that build likely wouldn't hit that bug (it required a particular sale to be active which was only at a few stores and had ended) it still was another step that they had to follow.
Aside: For all its clunkiness, Java Web Start was neat. In particular, it meant that instead of trying to push software to 5k registers (how do you push to registers that are powered off?), instead we'd push to 300 stores and from there JWS would check for an update each time it started up ( https://docs.oracle.com/javase/8/docs/technotes/guides/javaw... ). So instead of pushing to 5k registers, we'd have it pull from 'posupdate' on the local network when it rebooted.