Posted: March 17, 2020 | by Joachim Haller (Red Hat Accelerator, Sudoer)

Whether you’re moving your infrastructure to another data center, to a cloud provider, or to a colocation service, these guidelines will help you complete the tasks.

Corporate infrastructure always looks good in PowerPoint. It is well structured, the lines are straight, and clear colors visualize all the well-known configuration items, domains, and functions. Catchy names and abbreviations map out the most critical applications. A fuzzy cloud on top of the picture represents the customers, and underneath the infrastructure picture, there are some location names that represent the company users. The whole thing looks like a very functional building that is modern, solid, nice, and clean. Everything works and everything has a purpose. That’s the vision, but it’s not the reality.

Reality check

Once you close the presentation, leave the conference room, and head back to the corridors where you and your colleagues work, the daily interactions differ greatly from the PowerPoint glamour. Though you are most likely so used to it and know how to get stuff done, so process changes and new org charts do not bother you, until the day comes when the old data center will move to the cloud or some other location.

Shutting down or moving a legacy data center is like being forced to give up the ancestral mansion where generations have carved their path of life. Your role changes from a leisure tenant on familiar grounds to that of an archaeological researcher working in a parallel universe, applying scientific techniques to understand the intricate relationships between the endlessly entangled and continuously emerging artifacts.

How difficult can it be?

Moving to a new data center will awaken the sensation that someone should have done something about so many things, years ago. The desperate but futile wish to use the basic arithmetic one picked up in primary school is, of course, left unanswered. You have 200 applications. You shut down one of them. How many are left? Logic tells you it should be 199; however, the correct answer is 204, because shutting down one application will instantly reveal a handful of undocumented “features” that turn out to be small applications, and they all start to behave like tired children in a toy shop the very second after a parent said “No.” 

And by the way—of course, none of these smaller but crucially important little applications were shown on the PowerPoints or documented in the CMDB (Configuration management database). Most likely, the persons or companies that built the applications have evaporated, and the programming language used is most likely obsolete. Welcome to reality.

Traction is not instant

During the first few weeks of shutting down a data center, there’s excitement, and everyone can unite around the reasons why the glamorously aggregated and slightly over-optimistic value proposition was accepted by management. Meetings are held, plans are made, investigations are initiated, and emails are sent, while providers flock like vultures around an injured animal. Ambition goes hand in hand with overconfidence, and most of the plans, especially finance, seriously underestimate the complexity lurking within every server rack.

The train starts to move

The new environment—be it a new data center or something in the cloud—is filled with promises, and this spurs efforts to get the train in motion. The first few applications and functions are moved. The empty servers are being shut down and the first few incidents come to life. The Incident Management System is, of course, not prepared to handle these types of strange backward workflows initiated by people that are not even registered in the CM system. So, as expected, there is a bit of a buzz before the new ways of working are mapped correctly and the incidents can be managed accordingly.

The source of information

The CMDB is your compass and will help you to navigate the shallow waters between old and new. However, you should be aware that duplicate or missing records can play many tricks, especially when it comes to licensing and monitoring. Whatever workaround automation that has gone into the CMDB has to go OUT. We created a weekly “trust index” for the CMDB, where we tried to rate how accurate the CMDB was. This index was useful in discussions with managers and other teams across the globe just to ensure that nobody made “100% plans” and, instead, incorporate reasonable room for errors.

A well, or at least reasonably well, maintained CMDB is a great companion. A neglected CMDB will need a lot of work to get in to shape and will most likely generate many meetings where emotions run high as license gaps and various caveats are exposed. Once you get the CMDB going, you are on your way to completing the plan, so it is possible to determine what can be moved as is, what has to be migrated/changed, and what ends up in Sunset Park because it can be replaced or decommissioned.

Project dynamics

At the beginning of every large project, there are usually a few opinionated profiles who thrive in the change and the dynamics; however, when the project moves into a routine of moving servers, upgrading, rebuilding, and shutting down, these profiles run out of steam and leave.

The project is now part of business as usual, and the list of servers is checked and rechecked and constantly updated. The CMDB has settled, and the “out with the old, in with the new” routines work as expected. 

Applications are confirmed with the sometimes reluctant owners, who are forced to make decisions on a number of things such as costs, access rights, licensing, life cycle management, re-platforming, updates and patching, downtime, testing, etc.

Interesting (to say the least) workarounds keep popping up from nowhere. System and network administrators have the curtain of comfort removed while consultants and providers are dragged out from the blanket of bliss to confess their true colors.

There will be a constant string of surprises, and these are usually the resulting issues:

  • Workaround
  • Temporary fix (that was forgotten)
  • Undocumented solution/application
  • Something hardcoded
  • Known issue that has been ignored
  • A test that somehow morphed into production
  • A proof of concept that was left to linger (and grow)

What to move?

Some of the easier applications and less complicated hardware are moved to the new location, which makes the server racks look like the mouth of a six-year-old child: full of gaps. This is the best proof that things are happening.

Some services will stay, and the function will be progressively migrated over to the new location, and once completed, the original service hardware is decommissioned.

The plans will look different depending on the capacity and quality of the data links to the new location. Many functions need to be up while being migrated where others can (or must) be shut down for the move. 

Don’t change too many things at the same time, or you will have great difficulty identifying the source of the issue once something goes wrong. 
Consider how you would move:

  • DNS servers—should be easy but watch out for applications with hardcoded DNS entries
  • Networks—office and production sometimes exist oblivious to one another
  • Firewalls—always scary and have a massive impact, so go step-by-step
  • Directory servers—merge to use the same technology if possible
  • Antivirus—provides the opportunity to review what is protected and what is not
  • Support functions such as FTP servers, gateways, connection frameworks, etc.—review all connections and clean up (this can be anything from a really big task to a monumental one)
  • File servers—a very big chunk of data that needs to be moved incrementally
  • Slow data storage—depending on link speed and quality, this could be a candidate for a physical move if the storage technology is still valid and within support
  • Database servers—if possible extend clusters across the link to the new location

The grand master plan

The grand master plan is where all the servers and applications are listed. This is also where you list the names of application owners and the agreed-upon dates when things are to be moved. The plan should be centrally available so everyone can actively contribute and be up to speed with what’s happening. This is where you agree what goes into the three different tracks:

  1. Move as is (best)
  2. Move and upgrade (scary)
  3. Sunset (okay, but you might need to keep the application data due to regulatory requirements)

19 steps to heaven

So getting into the groove of moving, you need a checklist on how to decommission a server. “19 steps to heaven” was a proven track and it goes as follows:

  1. Check the CMDB for information on the current host. Refresh and append what is obviously missing.
  2. Check documentation if the server admin password is hardcoded in any application or service that is accessing the current server. Normally, these things are never documented, so expect some surprises. Applications with hardcoded passwords MUST be changed. Communicate this mandatory “best practice” to developers, admins, and solution owners. Communicate a date when the server password will be changed.
  3. Change the admin password on the current server. Remaining hardcoded workarounds will quickly surface. Keep track and enforce the necessary change to be implemented. Another benefit of changing the server’s password is that it will prevent anyone else that happens to know the old password from invoking a change that could compromise the next steps in this process.
  4. Check and document all services that start automatically on boot.
  5. Change all relevant auto-start services to manual.
  6. Check the server logfile for INCOMING connections from other applications and services, and stop these events at the initiating side. Keep at it until the logfile shows no more incoming events. Collaborate and communicate.
  7. Check the server logfile for OUTGOING connections and terminate these. Collaborate and communicate.
  8. Remove the server from all monitoring, including the collection of logs and other metrics.
  9. Remove the server from the group for auto-patching.
  10. Remove the server from the group for antivirus update.
  11. Remove the server from the group’s managing backup.
  12. Review the backups from the server in the storage and delete what is not relevant.
  13. Optional but recommended: Take a final backup of the complete server. Set an expiration date when the server backup can be safely deleted (e.g., 30, 60, or 90 days).
  14. Document all the software licenses that can be discontinued and communicate this to the appropriate contract/license management team.
  15. Make sure you receive “discontinue confirmation” from the team managing licenses. 
  16. Set the server to “decommissioned” in the CMDB.
  17. Shut down the server and physically remove its power supply.
  18. Look and listen for incidents related to the server being turned off.
  19. Physically remove the server and finish up the documentation.

If the data center only has one server, you can do this manually, but otherwise, you should script as much as possible of this procedure. Using an automation tool like Ansible to repeat the process is good practice and will speed up the process while ensuring nothing gets lost.

29 steps to move an app

Moving an application or service the old way is a daunting task that involves many teams and will take plenty of time. Here is an example workflow that was used when shutting down a data center:

  1. Service mapping
  2. Document architecture in current mode of operation
  3. Identify interdependencies
  4. Document the current application testing approach
  5. Document architecture for future mode of operation
  6. Security approach
  7. Confirm key interdependencies of Future Mode of Operation (FMO)
  8. Migration/deployment approach
  9. Test approach
  10. Agree on high-level design
  11. Technical specification—order environment
  12. Create detailed migration/deployment plan
  13. Detailed cutover plan
  14. Draft test plans
  15. Infrastructure delivered
  16. Build and test
  17. Handover servers (from infra)
  18. Validate the provided servers
  19. Perform migration/deployment
  20. Create go-live authorization checklists
  21. Go-live support plan
  22. Contingency and recovery plans
  23. Create service acceptance criteria
  24. Validate migration/deployment
  25. Perform cutover
  26. Go-live tests
  27. Go-live authorization checklist
  28. Handover to operations and service delivery
  29. Initiate decommission of old service

Do not abandon the plan; it is your only point of reference. It is fine to make changes to a plan, to move things around, but if you stick to it, it is always the start of any change-related discussion.

I would also like to give credit to my friend and former colleague Marcel Laurenz. Marcel developed and used the “29 steps to move an application” workflow.

https://www.redhat.com/sysadmin/how-shutdown-datacenter