There are many ways to manage your computers/infrastructure. One is to back everything up, use RAID disk arrays, expensive redundant systems, hotswap VM technology, etc. The idea is that the server or system never goes down. The problem with this approach is there are many more failure modes than a disk or hardware failure. An admin could accidentally delete some files (oops, that is not where I meant to run rm -rf
, etc). Hackers could get in and then you have no choice but to rebuild to be sure the system is clean. Operating systems rot over time (some more than others). The more complex a system is, the more failure points there are – even if you have redundancy.
The alternative is to design for complete system failure. Most applications can tolerate a couple hours of downtime if it is rare. So an alternative approach is to keep things simple and focus on automating installation/configuration and only backing up data. The idea is to be in a state where you can rebuild any system in a reasonably short amount of time. One of the side benefits of this is it forces you to completely document your installation process. If you have a “golden” machine where the entire system is redundant and backed up, you tend to get sloppy. Tweaks and hacks are made over time, and how the machine has gotten to the point, nobody knows. A rebuild of the entire system could take weeks. Alternatively, if you script the installation with Ansible, a rebuild can be as quick as 30 minutes. For most applications, you don’t need redundancy for the base system. It only add complexity and headache. A cheap physical or cloud server can easily serve 100’s of users. Why worry about complex setups (RAID configuration, redundant VMs, DB clusters, etc) … keep it simple – keep it clean, and focus your time and energy on things that bring real value to your organization.
Backup data, not systems.