Episodic Genius

occurring occasionally and at irregular intervals

Back it up!

How many times have you heard someone say that they weren’t able to adequately prepare for something because their computer had a problem and their data was lost. I’ve heard it a number of times. Since I can remember using computers there has been the threat of data loss.

A few years ago, despite everything that I could do, I went to find the original image file from my camera for a picture of my daughter. I couldn’t find it. I could find the original for every other picture that I looked for but not the one I wanted. I never found it. I posted the one 4x6 copy that I had on my refrigerator for a few years until it got lost. Fortunately, that is the worst pain I’ve had from losing data.

I’ve devised a number of schemes to protect my data over the years. Right now, everything that I care about sits on a mirrored ZFS filesystem and gets replicated nightly to another filesystem. I actually wanted to talk a little about one of my schemes that I’ve since abandoned but I still think that it had some unique good qualities. Much of the rest of this post was written years ago on a wiki page of mine.

The basic idea is to run regular, automated backup cycles each of which scans the entire contents of my file system and writes a file to determine which files have been changed or modified on the next run. It reads the file from the previous run and compares the two. It calculates how much space the changed files would occupy on one or more DVDs. Then, a unique part of the whole scheme was that it would take the free space on the DVDs and pack it tightly with files from the oldest previous backups. This kept the backups freshly rotated.

With all of this calculated, it prepared a DVD image and then burned it to a disk. This is where I came in. I’d get an email from the system and I’d grab the DVD and stick it in a sleeve while it burned a second copy of the same image.

There were a bunch of things that I liked about this scheme. “Write-once DVDs’” are about as cheap as they come. The burners for DVD media are also very inexpensive. I purchased mine for $35.

One thing I wanted to avoid here was the need to occasionally migrate data from one disk to another due to either aging disks or insufficient capacity. I don’t do well with tasks like this that require more than a trivial amount of effort. I tend to put the task off again and squeeze every last drop out of the aging disks. This adds risk of problems.

This strategy incrementally backs up new or modified files. It also back-fills space left on the disk with the oldest backups. This continually refreshes the media that I rely on for backups. A typical two week period at my home produces only hundreds of megabytes of new and modified files. This means that there is usually a lot of space left on the DVD. The result is that content on the older DVDs gets automatically refreshed onto new media rather frequently under my current load.

My tendancy to put things off required that my implementation be as automated as possible to enjoy any success. My current strategy uses cron on my Linux machine to kick off each backup. No user interaction with the system is required at all for the machine to produce a DVD image that is ready to burn.

When the disk image is ready to burn it gets written to a blank disk sitting in the burner. When it is finished and the disk has been verified using a checksum the system sends me an email telling me to grab a sleeve, label it with a label that it computes and put the DVD into the sleeve. I do this remembering to replace the DVD in the drive with a new blank one. This, along with the chore of carrying the newly burned disk to my safe, is the only manual intervention that is required.

Among the problems that I see with using a hard-disk as backup media is that the size is fixed. Eventually, the time comes for an upgrade. The write-once DVD strategy can grow more easily with need. There are a number of ways that capacity can grow gradually without any major effort.

Allowing backups to gradually use older disks is one easy way to grow capacity. I can grow capacity by several times without feeling like my disks are getting too old.

If that fails me I could easily “‘double the frequency”’ of my backup runs without breaking a sweat. This would “‘double capacity”’ while keeping the maximum age of the oldest media the same. I could also tripple the frequency. The only burden on me would be the increased number of trips to my house safe and my other backup location to store a new backup DVD.

While higher density media is currently quite a bit more expensive the price is dropping and will likely continue to drop. Another way to grow capacity is to move to larger media.

There are a number of inherent safety features built in to my strategy.

The data gets written to write-once media and verified using a cryptographically strong hash of the entire contents. Each disk is then stored in a “‘safe location”’. The media is disconnected from everything mechanical or electrical. This makes it immune to mechanical failure or elecrical events. Your home lost power, no problem. Your DVD drive failed, no problem (except that you have to fork out money for a new one). There was some kind of harmful power event on your lines, no problem. The number of catastrophic events that can effect your data is not zero but is limited drastrically by using write-once media in this way.

Backups are immutable. This is an extremely powerful concept. Once I’ve written a disk image to media and stored it there is nothing that can go wrong in future backup cycles that can destroy that data. I can not accidentally delete a backup of a file nor can I overwrite the contents of a backup with new, corrupted contents. I am 100% protected from making any error that would erase or modify the contents of an existing backup unless I physically damage the disk (both copies).

Redundancy is very easy. Currently, I burn two copies of each image. I store one in my home in a relatively safe location. I store the other several miles away from my home in a location that is convenient for me to access regularly.

Backups don’t get old. The contents of the oldest disk image is used to back fill extra space on newer images. This way, the oldest backup images are continuously obsoleted before they have a chance to age. I usually keep old disks around well past the point of obsolescence. This allows me to keep many full snapshots of my data to further increase the redundancy and usefulness of backups. For example, if I want to revert to an older version of a file, this is possible by starting with an older snapshot of the data. This is limited by the fact that my backup frequency is once every two weeks but is still very useful.

The only way to hack into this type of backup is to obtain possession of the actual physical media and insert it into a DVD drive. This risk is more acceptible to me than adding a second (and possibly a third) nework accessible system that contains all of my personal data.

All of the software needed to restore is either available from almost any Linux system or is included on each backup DVD. Once the disk is mounted then a full restore can be performed with tools like awk, rsync (or plain cp) and a shell script. The restore script on each CD will accomplish this task without any knowledge.

It has its drawbacks. One generally desirable feature to have in backups is to store the backups in a location that is geographically far away from the source. This way if something catastrophic happens with one the other is likely to not suffer in the same way.

I have thought about finding some sleaves like the ones NetFlix uses to mail DVDs and mail them to a relative. However, this would cost more and would require sending sensitive data through the mail. I could encrypt the data but that would require maintenance of an encryption key that would have to be accessible when I need to read from the backups.

I know that I also listed this same thing in my list of pros. However, it can be inconvenient to restore from backups that span 10-20 disks that aren’t readily accessible on the network even if the whole process has been automated.