Backups are for data, not system files

cbrake · August 18, 2021, 2:05pm

There are many ways to manage your computers/infrastructure. One is to back everything up, use RAID disk arrays, expensive redundant systems, hotswap VM technology, etc. The idea is that the server or system never goes down. The problem with this approach is there are many more failure modes than a disk or hardware failure. An admin could accidentally delete some files (oops, that is not where I meant to run rm -rf, etc). Hackers could get in and then you have no choice but to rebuild to be sure the system is clean. Operating systems rot over time (some more than others). The more complex a system is, the more failure points there are – even if you have redundancy.

The alternative is to design for complete system failure. Most applications can tolerate a couple hours of downtime if it is rare. So an alternative approach is to keep things simple and focus on automating installation/configuration and only backing up data. The idea is to be in a state where you can rebuild any system in a reasonably short amount of time. One of the side benefits of this is it forces you to completely document your installation process. If you have a “golden” machine where the entire system is redundant and backed up, you tend to get sloppy. Tweaks and hacks are made over time, and how the machine has gotten to the point, nobody knows. A rebuild of the entire system could take weeks. Alternatively, if you script the installation with Ansible, a rebuild can be as quick as 30 minutes. For most applications, you don’t need redundancy for the base system. It only add complexity and headache. A cheap physical or cloud server can easily serve 100’s of users. Why worry about complex setups (RAID configuration, redundant VMs, DB clusters, etc) … keep it simple – keep it clean, and focus your time and energy on things that bring real value to your organization.

Backup data, not systems.

bradfa · August 19, 2021, 2:14pm

But where/how do you backup your Ansible playbooks?

I think you’re right, from a workstation/laptop perspective. But from a server perspective, having snapshots or backups of full system state on a regular basis can be a life saver. Even if it takes you a long time to use such a backup, since server setup (ie: systems which can’t use Ansible, or where ever you actually store your Ansible data) and server data can be very valuable for reasons completely beyond how much your time costs (ie: legal requirements and such).

I’m far from a Ceph expert, but one of the neat things about it is how you can design your Ceph cluster with dramatic amounts of redundancy. Not only across disk failures, machine failures, or specific network failures, but you can have redundancy across racks of servers, across areas within a data center, across data centers, or even across geographic regions. You can start your cluster small, in just one place, in one rack, with a minimal number of machines and disks and then over time you can grow it. Any one of your failure domains could completely go away and the cluster as a whole will still function fine and try to fix itself.

cbrake · August 23, 2021, 3:28pm

@bradfa good points about Ansible playbook files and Ceph.

I clarified the title to say systems files instead of install/setup.

Ceph looks like a good tool to help ensure your data is available! Digital Ocean is using it in their products. However, I still would not put /usr|/bin|/sbin in ceph – at least for the simple cloud servers I manage. While it does not always makes sense to back up system files – you should backup the “process” to create and configure the system files. I store Ansible playbooks in Git and consider a Git repo “data,” so it makes sense to back up the Git repos, but I don’t back up the entire server the repos are on (at least at my small scale). If the Git server dies (a $5/mo Digital Ocean machine in my case), I would simply rebuild it. I could dig out the Ansible git repo from my lastest backup, but I already have copies checked out on several machines, so would simply use them to rebuild the Git server on a new VM, then restore the backup data, and I’m off and running. Even though my Git server has 50 users, if it is down for 1/2 day, it is not a big deal. I’ve yet to have a cloud server (AWS, DO, Linode, Vultr) fail on me. There are times when emails are sent that the cloud provider needs to move the machine to new hardware or something, but I’ve never completely lost a machine. However, if that does happen, then rebuilding is fairly simple.

For my workstations, I backup select files/directories, but I don’t backup the entire system. I have a playbook for installation and my dotfiles are in Git, so that goes pretty quick.

There are obviously cases where RAID or backed up system partitions make sense. DO offers backups of their VMs, and I use that feature, so I’m a bit of a hypocrite. However, I don’t operate in a mode where I depend on them – if a server needs rebuilt, the Ansible playbooks come out. I’ve watched too many people struggling with overly complex setups for relatively small datasets or simple use cases. I’ve also observed personal computers that have had layer after layer of crud installed over many years to the point they would barely run, golden build machines (where all builds (and even development) are done on one shared machine), and servers that had been hand tweaked over years. Each month that goes by, the machine becomes more valuable because no one really knows how it got to the current state, and the prospect of rebuilding it is daunting. However, if we view our systems as ephemeral, it is a freeing feeling. Let it crash – we have a process to rebuild it.

Most of us are not google, however this operating philosophy has a some things in common with how they operate.

bradfa · March 14, 2023, 10:40am

Sorry for huge delay in response! But I think that for people who often are managing servers (plural!) that having ansible playbooks or similar is a very good strategy. But if you have like 2 or 3 servers total you’re unlikely to have the motivation to try to automate those. Yes, it’s good practice to do so, but if you’re this 2-3 servers person you probably aren’t an expert at ansible because you probably don’t use it very often. And distros change often enough that once you figure it all out for a given major release of a distro, that distro has a new release with new unique concerns (and rolling distros may make this even harder).

For these kinds of people, I totally understanding backing up the entire system so long as you can fully restore the entire system. Lots of times doing this kind of backup is pretty straight forwards (power down, make image of disk, power up) so it gets used. Obviously not a solution once you have many servers but for a tiny number it’s perfectly fine.

Regarding putting core directories in Ceph, you can put your entire “disk” for a server in Ceph by using RBD as a block device for your virtual machine. Then you can snapshot those RBD images as needed as your backup strategy. Snapshots — Ceph Documentation