Our client wanted to replace a legacy production system with a new one having minimal impact to the end-customers. Here I'm explaining what were the biggest challenges and how we achieved it without almost no down-time.
Jarmo Pertman — 2021-05-11 (11 minute read)
Replacing one system with another without minimal impact to the end-customers can be very easy or almost impossible depending on the complexity of the system itself and its specific use-cases. For example replacing a static website with another is no big deal but replacing a service which has user accounts, orders and third party integrations is a much harder task. We had to deal with a complex system like that.
Another complexity with this particular task was that neither of these systems were developed by us and the original developers were not available for any feedback so we needed to figure out multiple details before having any specific plans in place.
Legacy System was responsible of taking orders from the end-customers, sending these orders for consolidation to an external SaaS type of a service, creating invoices after consolidation was completed and so on. Let me explain that with a simplified flow diagram:
On this diagram we can see that the end-customer places an order from the Browser which is sent to the Legacy system (step [1]).
It does send it to the SaaS service for order consolidation (step [2]).
Data required for debit invoice is sent to the Accounting system (step [3]).
Since end-customer can order up to a week in advance then order consolidation might happen up to 7 days after order placement. When order has been consolidated then all necessary information will be sent back to the Legacy system by the SaaS service (step [4]).
At this point it might happen that a credit invoice needs to be created due to not being able to consolidate all the items initially ordered by the end-customer. In this case Legacy system sends all the required data to the Accounting system (step [5]).
At the same time Legacy system will send out an e-mail to the end-customer (step [6]) about the consolidated order information with possible credit invoice and all the required information to the Courier system (step [7]).
Of course there were more systems involved, but let's keep the flow diagram simpler for the sake of this blog post.
New system has more or less the same flow. Technology stack and database is completely different. However it uses the same SaaS service for orders consolidation. Main problem with that is that this service doesn't support multiple systems meaning that it can only send back consolidation data to a single system.
Plan was to run New system in parallel to the Legacy system for some time before actual switchover. It meant that the New system was running in production under a different domain name from the Legacy system and it was mainly used by our clients' employees (it had a restricted access in-place). It was a real production system with real data processing real orders and money.
Due to the fact that the SaaS service used for orders consolidation supported only one endpoint there were a need to write a simple routing code to the Legacy system which would re-route consolidated orders data to the New system when it was initially placed from there - it was possible to distinguish orders by their identifier, because they were unique between the Legacy and New system. This is how it looks on the Flow diagram:
If order was placed from the Legacy system B1 (step [1]) then everything worked the same as before - order was sent to the SaaS service (step [2]).
Similar flow was also used when order was placed from the New system B2 (step [3]) - order was sent to the same SaaS service as was used by the Legacy system (step 4).
After order was consolidated then it was sent back to the Legacy system (step 5) where our small routing code decided if it needs to be rerouted to the New system or not (step 6).
Since databases were completely different between Legacy and New system then there had to be created some migrations regarding user accounts and past orders.
Also, all orders done via Legacy system before the switchover to the New system were required to be handled correctly when SaaS service sent back consolidation data.
Since end-customers could place their orders up to a week in advance then this meant that the Legacy system needed to be running for at least a week after the switchover to handle consolidated orders, send out e-mails and create invoices.
Easiest possible solution would have been to stop taking any new orders from the end-customers for a week prior switchover, but this was unrealistic (and quite a stupid) solution because it would have meant that our client's business would have been stopped for that period. I would like to see someone offering a solution like that to their client while keeping a straigth face. Of course this solution was not an option.
After the Legacy and New systems were running in parallel and consolidated orders routing logic was working we decided that the switch-over will basically happen via DNS configuration change - domain name pointing to the Legacy system will be switched to the New system IP address and all end-customers will end up using the New system just like that. Maximum planned down-time was few minutes and only a handful of customers getting affected.
Plan sounded really easy - how hard is it to change DNS entries after all? Of course there were still other things to do before doing that.
First thing which needed to be done was to change DNS configuration TTL value to the lowest possible value so that all DNS servers would react within a minute after configuration change.
After that we needed to start coding to migrate users and their data from the Legacy system database to the New system database. We decided to create an ongoing migration process, which we could start running way before the switch-over. It was basically a forever running migration process like this:
while (true) {
migrateUsers()
}
This solution allowed us to migrate all users continuously so that at the time of the switch-over everything would be up to date and we would not need to have additional downtime until migration process has been completed.
One problem with migrating users from one system to another is that their login credentials should also work in the New system. However, a different password hashing algorithms were used between the Legacy and the New system. This meant that there had to be written some code, which would allow end-customers to login with their Legacy system password and migrate it to the New system algorithm after successfully logged in for the first time in the New system.
Unfortunately migrating previous orders from the Legacy system would have been too complicated due to the way orders were stored in the databases compared between different systems, but we still migrated most popular ordered products for each end-customer into their respective Favorite Products functionality available in the New system.
Now that we had multiple forever running migration processes continuosly running and keeping state in sync beween the Legacy and the New system we still needed to handle consolidated orders coming back from the SaaS after switch-over.
After DNS configuration change SaaS would send back consolidated orders to the New system as expected. Now New system needed to reroute consolidated orders to the Legacy system when these orders were originally placed from there. Again, a simple rerouting code needed to be written:
Order was created to the New system when placed from the Browser (step [1]).
Order was then sent to the SaaS service for consolidation (step [2])).
Consolidated order was then sent back to the New system (step [3]) which rerouted it to the Legacy system based on its identifier.
Remember, orders could have been placed a week in advance just before switch-over from the Legacy system to the New system. This meant that the New system needed to be ready to reroute orders to the Legacy system for at least a week.
Now that all components were in place a date was set for the actual switch-over.
One mistake which is pretty often made during that kind of Big Bang releases is that they are done during night-time at the end of the week - this means that everyone involved will be very tired or that some key person is out of reach when they are really needed. Problems are caused by the tiredness and any unexpected situations might not be handled as well as they could have been handled. That kind of releases never happen without any surprises.
For us everything happened on a Monday morning during normal work-time (of course our client wanted it to happen during night-time, but we persuaded them not to do that same mistake). This allowed us to be fresh and ready to tackle any upcoming problems. Also, rest of the week was still ahead for fixing any other problems that might occur after the release.
On a previous day an informational banner was put on the Legacy system frontend to warn end-customers about the upcoming service interruption. DNS configuration was changed and voilà - Legacy system had been replaced with the New system.
Since Legacy and New system both are using so-called Single Page Application or SPA technology stack then end-customers who were on the Legacy system frontend during the switch-over needed to refresh their browser to land on a New system frontend (would not have been a problem at all with non-SPA situation). Also, every customer who were in the middle of filling up their cart lost its contents due to the unreasonable complexity of migrating these.
All in all only a few end-customer complaints landed into customer support related to the actual switch-over. There were of course some complaints about some non-critical broken functionality or about the fact that Legacy system was better because of X or Y.
Switch-over itself went according to the plan - downtime of the service was as planned, only a few minutes and everything seemed to work smoothly. Or was it?
As already stated above then Big Bang releases like these will always have some surprises coming up.
One of the biggest surprise came from one of the biggest ISP-s in Estonia which decided not to respect the DNS configuration change for up to 3 hours. Every other DNS server we tested against used new DNS configuration, but that big ISP didn't. In the end they even said that the problem was not caused by them but by our DNS configuration. We couldn't find any problems with it together with all the other DNS servers. That kind of a problem was something which we could not have foreseen even in our darkest dreams. But then again, then there's Murphy.
Another problem was that there was a quite big memory leak which didn't come out during performance testing because that part of functionality was turned off. It was the part which handled sending out e-mail and SMS notifications to the end-customers and having it turned on during performance testing would have caused additional cost to our client. Again, Murphy was involved. But thanks to our monitoring setup we could see that problem on the charts and had a battle plan in place to tackle it.
Days after the switch-over brought some new minor problems by the end-customers, but there was nothing which was critical or hindered our clients' business itself.
It was a big relief for us since as I already stated above the application was not written by us and we had already seen multiple places where code was smelling of bad practices and couldn't give any guarantees if there's anything worse someplace else. Yes, code has been full of problems, but thankfully not anything significant. We have been improving the code-base every day by removing unneeded code, fixing it and adding new functionality.
Few weeks after the switch-over the best thing happened which could happen within any software development project - we deleted all the code needed for any migrations, routing etc. and shut down the Legacy system completely.
It has been a pleasure to be part of that project since it was quite a challenge to come up with the plan of how to replace the Legacy system with the New system. It's one of these projects where client can't assign us any tasks because even he doesn't know what should be done from technical perspective. Their goal is just to replace one system with another, but developer's goal is to think everything through of how to get there with causing as less pain as possible to the client and their end-customers. Of course there needs to be someone from the clients' side who knows business processes really well, otherwise it's impossible to do that kind of a task.
The biggest trick with that kind of projects is not to write a lot of code, but to write as few lines of code as possible into the correct locations. Code written for adding support for replacing the Legacy system with the New system was maybe a few hundred lines long in total. But to get to these lines it took some time.
In the end we considered it to be a successful project and as far as we know then our client is happy too.
Solutional is an agile software development company which has a team of professional engineers who are able to solve all software problems from beginning to the end without any middlemen.
Contact us at info@solutional.ee in case you have any new or existing projects needing help with successful execution.