User Tools

Site Tools


horror_stories:mirroring_is_not_backing_up

Why Mirroring Is Not Backing Up

Until a time very shortly before this page was written, there used to exist a large blog site called journalspace. This site has now completely ceased to exist in any meaningful sense. While the cause of the disaster is not known, the reason that this problem became a fatal one, instead of merely a late night for the admins, was the lack of any meaningful backup solution. The following quote is from http://journalspace.com/this_is_the_way_the_world_ends/not_with_a_bang_but_a_whimper.html

Tuesday:

Journalspace is no more.

DriveSavers called today to inform me that the data was unrecoverable.

Here is what happened: the server which held the journalspace data had two large drives in a RAID configuration. As data is written (such as saving an item to the database), it's automatically copied to both drives, as a backup mechanism.

The value of such a setup is that if one drive fails, the server keeps running, using the remaining drive. Since the remaining drive has a copy of the data on the other drive, the data is intact. The administrator simply replaces the drive that's gone bad, and the server is back to operating with two redundant drives.

But that's not what happened here. There was no hardware failure. Both drives are operating fine; DriveSavers had no problem in making images of the drives. The data was simply gone. Overwritten.

The data server had only one purpose: maintaining the journalspace database. There were no other web sites or processes running on the server, and it would be impossible for a software bug in journalspace to overwrite the drives, sector by sector.

The list of potential causes for this disaster is a short one. It includes a catastrophic failure by the operating system (OS X Server, in case you're interested), or a deliberate effort. A disgruntled member of the Lagomorphics team sabotaged some key servers several months ago after he was caught stealing from the company; as awful as the thought is, we can't rule out the possibility of additional sabotage.

But, clearly, we failed to take the steps to prevent this from happening. And for that we are very sorry.

So, after nearly six years, journalspace is no more.

If you haven't yet, visit Dorrie's Fun Forum; it's operated by a long-time journalspace member. If you're continuing your blog elsewhere, you can post the URL there so people can keep up with you.

We're considering releasing the journalspace source code to the open source community. We may also sell the journalspace domain and trademarks. Follow us on twitter at twitter.com/jsupgrades for news.

This is a perfect example of the difference between mirroring and backups. Both make copies of your data elsewhere. The major difference is the concept called “fate sharing”. In mirroring of any kind - be it RAID, database clustering, or cloud storage - barring any catastrophic failure of one of the components, the others all remain in near real time sync. If one of the hard drives in a RAID array suddenly goes offline, the remaining ones will continue along, hopefully alerting the sysadmin to replace the failed one.

But what about when something happens that is perfectly normal as far as the storage layer is concerned, but is catastrophic from an application or OS layer? In a mirroring scenario, that disastrous command will be faithfully replicated, ensuring that the destruction is complete.

In a real backup solution, though, the backups are stored in a medium that is completely independent from the original data. The canonical example is a complete set of tapes, stored in a firesafe located 50 miles from the servers being backed up. Completely isolated from the protected systems, they are safe from everything from accidental deletion up to a good sized natural disaster. They are even quite resistant to the insider threat of rogue system administrator: he may be able to screw up future backups, but ones already in the safe are quite difficult to get to.

Good mirroring setups will certainly help you in scenarios where previously your only solution was to haul out the tapes. A good RAID array that used to mean downtime for a server, now just means pulling a spare off the shelf and a small performance hit while the array rebuilds. Putting your resources in a cloud can even mean surviving the loss of an entire site, if you have everything distributed properly.

But sooner or later, you're going to get hit by some little gremlin, be it a bad command or a bug, that will happily ride that mirroring pipeline to where it can eat every copy of your data. When that happens, your only hope will be that pile of dusty old tapes in the corner.

Have you verified your backups lately?

horror_stories/mirroring_is_not_backing_up.txt · Last modified: 2010/03/15 16:20 by fs