Windows Application and Server Backup 2.0

Introduction

The first backup—technically—was around 1951, when the first generation of digital computing appeared in the form of UNIVAC I. The "backups," such as they were, were the punch cards used to feed instructions to the massive machine. Once computers began to use more flexible forms of storage, reel‐to‐reel magnetic tape began to replace punch cards. In 1965, IBM introduced the first computer hard drives, although through the 1970s, these devices remained impractically expensive to use for backups. Floppy disks came into use in 1969—an 8‐inch monster storing just 80 kilobytes of data. Recordable compact disks became available in the early 1990s, and flash drives became common in the early part of the 21st century. Shockingly, magnetic tape—the second‐oldest form of backup storage—is still in use today. Figure 1.1 shows a timeline of data backup storage (excerpted from www.backuphistory.com), and you can see that tape is still alive and well—and has been for almost 50 years.

Figure 1.1: Backup storage timeline.

There's an interesting parallel to be drawn here: Despite numerous technical advances in storage, we continue to rely on one of the oldest mediums to store backups. The same applies to our backup techniques and procedures: Despite advances in how we perform backups, we tend to still use the same decades‐old techniques, albeit wrapped up in pretty new tools.

Throughout computing history, backups have been practical, simple procedures: Copy a bunch of data from one place to another. Complexities arise with "always‐on" data like the databases used by Exchange Server and SQL Server, and various techniques have been developed to access that form of in‐use data; however, backups have ultimately always been about a fairly simple, straightforward copy. Even magnetic tape—much more advanced than in the 1960s, of course—is still a primary form of storage for many organizations' backups.

I call it "Backup 1.0"—essentially the same way we've all been making backups since the beginning of time, with the only major changes being the storage medium we use. Although many bright engineers have come up with clever variations on the Backup 1.0 theme, it's still basically the same. And I say it's no longer enough. We need to re‐think why we do backups, and invent Backup 2.0—a new way to back up our data that meets today's business needs. Surprisingly, many of the techniques and technologies that support Backup 2.0 already exist—we just need to identify them, bring them together, and start using them.

The Philosophy of Backup

Let's start with the question, "Why do we back up?"

I suppose the first answer that comes to mind is simple enough: So that we don't lose any data. But that's not actually an accurate answer. Personally, I don't back things up just so they will never be lost; I back them up so that I can continue using them. That's a subtle difference, but an important one. If you're only concerned about never losing data, Backup 1.0 is probably sufficient: Copy your data to a long‐term storage medium—probably magnetic tape—and stick it in a vault somewhere. You could call it archiving and be more accurate, really. But most organizations aren't as concerned about archiving as they are about making sure that data remains available, which means you not only need a backup but also a means of restoring the data to usability. So the real answer, for most organizations, is more complicated: So that our data remains available to us all the time.

That's where Backup 1.0—the backup techniques and technologies we've all used forever and are still native to operating systems (OSs) like Microsoft Windows—can really fail. Making a copy of data is one thing; putting that copy back into production use is often too slow, too complicated, and too monolithic. And that's where Backup 1.0 fails us. As we start considering Backup 2.0, and what we need it to do, we need to bear in mind our real purpose for backing up. Ultimately, we don't care about the backing up part very much at all—we care about the restore part a lot more.

Why Backup 1.0 Is No Longer Enough

Our decades‐old backup techniques are not sufficient anymore. They may be great for creating backups—although in many cases, they aren't even good for that—but they do not excel at bringing the right data back into production as quickly as possible. Despite advances in specialized agents, compressed network transmissions, and so forth, we're still just making a copy of the data, and that doesn't always lend itself well to restoring the data. Why?

Backup Windows

One problem is the need for backup windows, periods of time in which our data isn't being used very heavily, so that we can grab a consistent copy of it. Consistency is critical for backups: All the data in any given copy needs to be internally consistent. We can't back up half a database now and the other half later because the two halves won't match. As our data stores grow larger and larger, however, getting a full backup becomes more and more difficult.

Microsoft's TerraServer, which stores and provides access to satellite photographs for the entire United States, has a data store in excess of 1 terabyte, and even with fairly advanced backup hardware, it still takes almost 8 hours to back it all up. That's a total data throughput of 137GB per hour—but if that data were in constant use, it would become less practical to make a complete copy on a regular basis.

As a workaround, we commonly use differential or incremental backups. These allow us to grab a lot less data all at once, making it easier to make our backup copy. The problem is that these ignore the real reason we made the backup in the first place—to enable us to restore that data. Consider a common backup approach that uses SQL Server's native backup capabilities:

  • Sunday, full backup
  • Every weekday evening, a differential backup
  • Every hour during the day from 8am to 5pm, a log backup

Those weeknight differentials grab everything that has changed since the last full backup (as opposed to an incremental, which grabs everything that has changed since the last full or the last incremental). If something goes wrong on Friday at 4:05pm, there's a lot of data to restore:

  • Last Sunday's full backup—which will be fairly large
  • Thursday's differential—which will also have grown quite large
  • The log backups from Friday at 8am, 9am, 10am, 11am, noon, 1pm, 2pm, 3pm, and 4pm—that's nine files in all, although each will be fairly small

Figure 1.2 illustrates these different backup types.

Figure 1.2: SQL Server backup types.

That's a ton of work—and a ton of waiting while tapes and hard drives spin, restoring all that data. Sure, we won't have lost much—just 5 minutes worth of work—but on a large database (say, a terabyte or so), you could easily be waiting for 16 hours or more.

And that's if the backups all work. Tape drives and even hard drives are not immune to corruption, errors, and failures, and one of the most common stories in the world is the administrator who realized that the backup tapes were no good—and realized it while trying to restore from one of them. We all know that we should test our backups, but honestly, do you do it? Out of more than 300 consulting clients I've worked with in the past 10 or so years, one of them had a regularly‐scheduled plan to do test restores. One. Less than one percent. Why?

Well, the one customer who did regularly‐scheduled test restores had a dedicated administrator who did almost nothing else. A full test restore of their environment, using Backup 1.0‐style techniques and technologies, would require the entire IT team about a week to perform. That one administrator could test‐restore various systems, one at a time, over the course of a month—and then start over. I think that pretty much answers the question about why so few people test their backups. It's a simple: Too much data for full backups results in workarounds such as differentials and incrementals that contribute to lengthy restore times, which is why we never bother to test—and why, when the rubber hits the road and we need those backups, nobody is happy about how long it takes to grab the needed data.

Non‐Continuous

Backup 1.0 has another major weak point: It's always a point in time. A snapshot. Noncontinuous, in other words. Consider one approach to backing up Active Directory (AD), which I see a lot of my customers using:

  • Full backup of every domain controller's System State on the weekends. This is a quick, fairly small backup—even an exceptionally large domain can be backed up in a few minutes.
  • On one or two domain controllers, a twice‐daily backup of the System State. Again, this operation is quick.

The problem is that you always stand to lose a half‐day's worth of work because you're only taking "snapshots" twice a day. What if you just imported a couple of hundred new users into the domain, added them to groups, and assigned them file permissions on various file servers? Losing that work not only means you have to start over, it also means you've got orphan Security Identifiers (SIDs) floating around on all those file permission Access Control Lists (ACLs). You'll not only have to repeat your work, you'll also have to clean up all those ACLs.

Businesses tend to design backup strategies around the concept of "How much data are we willing to lose, and how much work are we willing to repeat?" There's often a lot less consideration about "How quickly do we want to restore the data?" and much to my irritation, almost nobody thinks to answer, "We don't want to lose any data or repeat any work!" Backup 1.0 has conditioned us to accept data loss and repeated work as inevitable, and so we design backup schemes that tradeoff between the inevitable loss of data and the amount of backup resources we want to devote. Frankly, the attitude that I have to accept data loss and repeated work is nonsense. I can't believe that, a decade into the 21st century, we're all so complacent about that attitude.

Just the Data—Not the Application

Another problem with Backup 1.0 is that we often tend to just back up data—databases, files, System State, or whatever—we rarely back up the applications that use the data. I took a brief survey on my blog at ConcentratedTech.com, and about 95% of the respondents said they don't back up applications because they keep the original installation media so that they can always reinstall the application if necessary.

Really? Let's think about that: Microsoft Exchange Server takes about 45 minutes to an hour to install properly. Then you have to apply the latest service pack—another half‐hour or so—and any patches released since the service pack—call that another 20 to 30 minutes. Then you can start restoring the data, which may take another few hours. If you're rebuilding an entire server, of course, you'll have to start with reinstalling Windows itself, and its service pack and patches, which will add another 2 or 3 hours to the process. The total? Maybe a full work day. And for some reason, people find that acceptable—because Backup 1.0 is all about archiving, really, not restoring.

One valid counter‐argument is that most restorations are for just the data, or even a part of the data (like a single database table, or a single email message), and aren't a full‐on disaster recovery rebuild. Well, okay—but does that mean it's still acceptable for a full‐on disaster recovery rebuild to take a full day? Typically not, and that's why some organizations will image their servers, using software that takes a snapshot of the entire hard drive and often compresses it to a smaller size. Used for years as a deployment technique, it works well for backing up an entire server. For backing up the server but often not restoring it. Snapshot, or point‐in‐time, images take time to produce, and the server may even have to be shut down to make an image—meaning you'll need a frequent maintenance window. A traditional snapshot image won't contain the latest data, so even after restoring the image, you still have to rely on traditional backups to get your most recent data back. It just amazes me that we accept these limitations.

Disaster Recovery Is Too Inflexible

Closely related to the previous discussion is the fact that people do backups for two different reasons. Reason one, which I think is probably more commonly cited as a reason to back up is to restore small pieces of data. You've doubtless done this: Restored a single file that someone deleted by accident, or an email message, or a database table. Nearly everyone has dealt with this, and it isn't difficult to "sell" this reason to management when acquiring backup technologies.

The second reason is for what I called "full‐on disaster recovery" in the previous section. This is when an entire server—or, goodness help you, an entire data center—is lost, and has to be restored on‐site or at a different location. Unfortunately, this level of disaster is actually quite rare, and it's a tough sell for management if the organization hasn't encountered this type of disaster in the past.

The ultimate problem is that Backup 1.0 technologies lend themselves to one or the other scenarios—but not usually to both. In other words, if you have a product that does great bare‐metal recovery, it may not do single‐item recovery as well. Some products compromise and do an okay job at both—backing up the entire server to enable bare‐metal recovery (and often providing bootable CDs or other techniques so that you can initiate a bare‐metal recovery), and then keeping a separate index of every backed‐up piece of data to make single‐item recovery easier. Frankly, I've not used many solutions that do a great job at both tasks—and the fact that they're all essentially snapshot‐based still makes them pretty limited. I never want to have to agree that losing a certain amount of work is acceptable.

Backup 1.0: The Verdict

If you're just archiving data, Backup 1.0 is pretty awesome. It starts to fail, though, when you need to restore that data, and want to do so in an efficient manner that enables both single‐item and full‐on disaster recovery restoration. The snapshot‐oriented nature of Backup 1.0 means you're always at risk of losing some work, and that snapshot‐oriented nature also imposes rigid requirements for maintenance and backup windows—windows that might not always be in the business' best interests.

So let's rethink backup. I want to go back to basics and really define what backups should do. Consider this definition a wish list for Backup 2.0.

Backup Basics

I have no doubt that you're a pretty experienced administrator or IT manager, and you might not think that "backup basics" is a particularly enticing section. Bear with me. I'll try to keep each of the following sections succinct, but I really want to step back from the existing technologies and techniques to focus on what people and businesses—not software vendors—want their backup programs to do. We've been doing backups more or less the same way for so long that I think it's beneficial to just forget everything we've learned and done and start over without any assumptions or preconceptions.

Why Back Up?

We've covered this topic pretty well, but let me state it clearly so that there's no confusion:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

That statement implies a few basic business capabilities:

  • When a problem occurs, we want to experience as little data loss as possible
  • We need to be able to recover data as quickly as possible
  • We place equal importance on recovering a single bit of data as we do in dealing with a complete disaster

The statement also means a few things traditionally associated with Backup 1.0 probably aren't acceptable:

  • Snapshots that grab only a certain point‐in‐time image are less desirable
  • Any system that is weighted toward disaster recovery or toward single‐item recovery is less desirable—we need both capabilities
  • Any system that requires lengthy, multi‐step restore processes is less desirable
  • Backups that do not lend themselves to some form of physically protected storage are less desirable
  • Backups that require hours and hours to complete will require hours and hours to restore—both of which are less desirable

So given why we back up, we can take a fresh look at what we back up.

What Do You Back Up?

Even the relatively primitive backup software included with Windows Server 2003 (and prior versions of Windows) understood that you don't always need to back up everything every time you run a backup. Figure 1.3 shows how that utility allowed you to select the items you wanted to back up or restore—a user interface (UI) duplicated in some form by most commonly‐used backup software.

Figure 1.3: Selecting what to back up or restore in Windows Backup.

So what do you back up? On any given server, you have many choices:

  • Back up the entire server—every file on every disk
  • Back up just data—shared files, application databases, and so forth
  • Back up applications and their data—including the applications' executable files and settings
  • Back up the OS and its settings but not any application files or any data

The permutations are practically limitless. If you're simply after archiving—creating backups, that is—then anything that grabs the data is fine. In fact, just grabbing the data is probably all you need to do if all you want to do is create a point‐in‐time snapshot for archival purposes. Sure, you could back up the entire server—and if your goal is to be able to handle a full‐on disaster or recover individual items, then backing up the entire server would provide both capabilities. But backing up the entire server would probably take a lot longer. It might not even be entirely possible because there will always be some open files, running executables, and other items that Backup 1.0‐style backup techniques can't get to. So maybe backing up the entire server isn't really practical. After all, we can always reinstall the OS and any applications from their original media, right?

Wait, a second—let's go back to why we're backing up:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

Okay, this statement clearly indicates that we need to grab the data, but that last phrase— "…have access to our data with as little downtime as possible"—adds something important. In order to access our data, we need the OS and associated applications to be up and running! A data backup is useless without an application and OS to restore it to. I've already explained why rebuilding a server using the installation media is so slow—you have to perform lengthy installs, then install service packs and patches, and then restore your data.

This tells me, then, that our backup must be of the entire server. After all, I'm not just here to archive my data—I also need to be able to restore it quickly, and get access to it quickly, and that means I need to restore the OS and any applications quickly. So regardless of practicality—for now—I'm going to say that backing up the entire server is the only way to go. I just need to figure out how to do it quickly.

When Do You Back Up?

How often will I be making backups? Under Backup 1.0, this was a real question. Often, you might take a full backup during an evening or weekend maintenance window, then grab smaller backups more frequently. When I started out in IT, I was an AS/400 operator. Every evening, we made tape backups of our most important data files from the AS/400; on weekends, we ran a full backup of the entire system. Backing up the data files took a few hours, and we had to do it in the evening after pretty much all the work for the day was finished because the data was unavailable while the backups were running (in fact, we pretty much kicked everyone off the system while backups were being done). The weekend full backups could take a full day, and everyone had to be offline then.

But maintenance windows are a Backup 1.0 concept, so let's disregard them. In fact, let's review our whole reason for being here one more time:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

Right there is my answer: "…from losing any data or losing any work…." That tells me the Backup 1.0 method of point‐in‐time snapshots is useless because no matter how often I'm making incremental or differential backups, I'm still going to be at risk for losing some work or data, and that's not acceptable.

So if you ask, "When will you make backups?" I have to answer "Always." Literally— continuous backups. In fact, the industry already has a term for it: continuous data protection. Although specific techniques vary, you can think of this—very roughly—as being similar to RAID 1 drive mirroring. In a RAID 1 array, as Figure 1.4 shows, the drive controller writes blocks of data to two (or more) hard drives at once. Disk 2 is a complete, block‐level backup of Disk 1. Restoring is fast—simply swap the two if Disk 1 goes belly‐up.

Figure 1.4: RAID 1 disk mirroring.

Of course, RAID 1 is good for a certain type of scenario but it isn't practical for all situations, and it doesn't meet all our backup requirements. For one, server disks are still pretty expensive. In a server using a large number of disks, mirroring every one of them isn't practical. Some organizations will set up a RAID 5 array and then set up a second array to mirror the first—but that can be incredibly expensive. A further problem is that the backup disks coexist with the primary disks, so the backup disks are still at risk for damage due to fire, flood, and other physical threats. Further, a mirror is only a backup for the current condition of the primary disk: You can't "roll back" to a previous point in time. So mirroring alone is a great tool for certain situations, but it isn't a complete backup solution. What it is, though, is a good idea for how to make continuous backups. We just need to leverage the technique a bit differently.

What Type of Backup Will You Use?

Full backup? Differential? Incremental? I have to say none of these because those Backup 1.0 terms are associated with point‐in‐time snapshots, and we're not going to use those. Instead, we're going to use a Backup 2.0 technique, borrowing from the RAID 1 concept of block‐level mirroring. Part of continuous data protection, we'll call this block­level backups. Figure 1.5 shows how it might work.

Figure 1.5: Agent­based block backups.

Here, a software agent of some kind runs on the server. As blocks are changed on disk, this agent transmits those blocks to a backup server, which stores them. That server might store blocks for many different servers. The agent would likely tap into a very low level of the OS to accomplish this. The benefits:

  • We can have nearly real‐time backups of all changes, as they happen
  • We get the entire server, not just the data
  • We can store more than just the current blocks; in fact, we can store changes as far back as we want, meaning we can restore any given file—which consists of multiple disk blocks—to any point in time we want
  • We can restore the entire server by simply writing all the latest backed‐up blocks to a server—either the original server or a replacement
  • In the event of corrupted blocks, we might still be backing those up—but we've also got older, non‐corrupt versions of those same blocks, so we can potentially repair files that have become corrupted to their most recent, non‐corrupted point in time

This is powerful magic, but it's a reality: Today's market includes solutions that follow this technique. It delivers Backup 2.0: continuous backups that are designed for restores, not for archiving.

Where Do You Store Backups?

Do you need offsite storage of your backups? Probably yes. In 1996, Paris‐based Credit Lyonnais had a fire in their headquarters. Administrators ran into the burning building to rescue backup tapes because nothing was stored off‐site. Let me write that again: Ran into a burning building. Folks, fire is a constant possibility, as is the possibility of damage from floods (bad plumbing anyone?) and other disasters. If the data is worth backing up, it's worth keeping copies somewhere else. At the very least, have some sort of on‐site storage that's disaster‐proof—a waterproof fire safe, for example. Does that mean you have to use magnetic tape? No, but you probably will, simply because it's relatively inexpensive, fairly durable, and easy to work with. You'll likely end up using tape in conjunction with something else, in fact, with tapes being the sort of last‐resort place to restore from. The point is this: Don't assume that a major disaster will not strike you. Past performance is no guarantee of future results; just because you've never been hit by a disaster before doesn't mean you won't be hit by one eventually. That's why people buy insurance policies, and backups are basically a form of insurance.

In terms of Backup 2.0, we might combine our block‐based backups with some tape‐based storage. Our backup server, for example, might periodically dump all its backed‐up blocks to a tape array, allowing us to carry a snapshot offsite for archival purposes and to protect against a total disaster to our data center.

Should You Test Backups?

This is a trick question. The answer is, "Of course." As I've already explained, though, few folks actually do. Why? Well, there are really a few reasons, many of which are related to our Backup 1.0 mindset.

First, as I mentioned earlier, is the time commitment. Spending hours doing a test restore isn't in most folks' budgets these days. Of course, a block‐based restore can actually be done more quickly: You're streaming the restore from disks over a high‐speed network, not reading them ever‐so‐slowly from a tape drive.

Second, there's the availability of hardware. Now, if you're testing single‐item recovery, most backup solutions will allow you to restore files to any location you want, so you just need a small spot on an existing file server. But you should also be testing full‐on disaster recovery, where you restore a server that was completely lost (say, to fire) to a different piece of hardware. The problem is that many Backup 1.0‐style solutions require you to restore to identical hardware, meaning you have to have a lot of spare servers around. Not gonna happen. A good solution will let you restore to dissimilar hardware, which is actually more practical from a disaster recovery perspective; an ideal solution will let you restore to a virtualized server, which is absolutely perfect. So yes: Test your backups. Regularly. Perform both single‐item restores and the type of bare‐metal restore you'd associate with a total disaster, ideally utilizing modern virtualization technologies to eliminate or reduce the need for extra "test restore" hardware.

Virtualization: More than Just Consolidation

As I'll discuss in later chapters, virtualization is an integral part of the Backup 2.0 mentality. In Backup 1.0, a full‐site disaster might mean retreating to a special‐purpose, leased offsite recovery facility and starting a lengthy restore process using your most recent off‐site backup tapes.

In Backup 2.0, that "facility" might live on the Internet and consist of one or more virtualization hosts, each running a dozen or so virtual servers. You stream your latest backups over the wire to the virtual servers, performing a "bare virtual metal" recovery rather than recovering to actual, physical machines. This approach makes it more practical to recover a set of servers, makes that recovery faster and cheaper (no leased facilities), and makes it more practical to conduct occasional test runs of a complete disaster scenario.

Are You Keeping an Eye on Your Backups?

Do you monitor your backups? Other than just checking an error log, I mean? You should. In fact, checking error logs—aside from being incredibly boring—is the kind of old‐school Backup 1.0 mentality I'm trying to change in this book. A modern, Backup 2.0‐style solution should alert you to problems, and ideally might even integrate with an existing monitoring solution, such as Microsoft's System Center Essentials or System Center Operations Manager.

What might those alerts actually inform you of? Primarily, problems with the backups themselves, such as corruption. Nothing's worse than suddenly finding you need your backups and then realizing that they're no good due to corruption; you should be informed of corruption as soon as it occurs. A good Backup 2.0‐style solution might inform you via email, might drop something in a Windows event log (which System Center Operations Manager could then pick up and raise to your attention), or perform some similar style of notification.

Figure 1.6 shows how System Center Operations Manager can be used to configure an alert for a given type of event log entry—such as an event log entry created by your backup solution, alerting you to corruption. Note that System Center Essentials works similarly for this type of alert.

Figure 1.6: Creating an event log entry alert in SCOM.

A really "with it" backup solution might even come complete with a System Center Management Pack. A Management Pack is basically a preconfigured set of rules for monitoring and alerting on a specific application; the pack tells System Center Operations Manager what to look for (such as specific event log IDs) and what to do (such as sending email alerts). But if your backup solution doesn't come with a Management Pack, at least make sure that it has its own built‐in features for providing these types of notifications.

Approaches to Backing Up

Having advocated for block‐level backups, I want to take a step back and briefly review the entire gamut of backup possibilities. Although some of these don't meet my requirements for a good backup and recovery system, they nonetheless offer some business value that you should be aware of.

File‐Based Backups

File‐based backups are the oldest type of backup, and probably still the most common. It involves simply taking a snapshot—a point‐in‐time copy—of one or more files. These types of backups may have difficulty working with open files, and they don't capture every change made to a file—they just grab a copy of it at a specific time.

But there's value here. For example, Windows' built‐in Volume Shadow Copy feature is essentially an on‐server file‐based backup, grabbing copies of files as users change them and storing them in a local cache on the server. Users can access these cached fie versions on their own, using Windows Explorer's Previous Versions feature, as Figure 1.7 shows.

Figure 1.7: Accessing previous versions.

The key phrase here is on their own. Unlike many data center‐class backup solutions, Volume Shadow Copy / Previous Versions is designed for user self‐service. Properly used (meaning your users will need a bit of training), this feature can help prevent calls to the Help desk—and overhead for you—when users need to roll back a file to a somewhat older version. In fact, I've seen this feature—again, with a bit of end‐user training—reduce single‐file recovery Help desk calls by 90% in several of my clients. Those organizations tell me that the average cost for completing a file recovery request is about $75, and the ones that keep really good records average about four calls a week. That's a savings of more than $15,000—all for free, since the feature is built‐in to Windows (well, you do have to spend a bit extra for the disk space needed to store the cache).

In a Backup 2.0 world, there should be room for complementary solutions. Previous Versions meets a specific need: user self‐service for individual file rollback. Whatever data center backup solution you select shouldn't interfere with complementary solutions, and in fact, should embrace them. Where file‐based backups tend to fall short—as I've discussed—is in the whole‐server backup scenario, which you'd be performing in the data center.

Image Backups

Image backup is another term for what I've been calling block‐based backups. There are really two ways to achieve an image‐style backup: by using solutions such as the tried‐andtrue image snapshot software and via what I'll call "streaming" images.

Traditional imaging software, often used for deploying operating system images to new computers, commonly makes point‐in‐time snapshots of a disk, typically compressing the disk blocks so that the resulting image file is much smaller. This software isn't usually positioned as a backup and recovery solution but rather as a deployment solution: You make a "template" computer the manual way, image it, and then deploy that image rather than installing other computers manually. It's most commonly used for desktop deployment, and it can serve as a kind of last‐ditch recovery tool for desktops, essentially re‐deploying the image to get a machine back to "square one." But as a point‐in‐time snapshot, it's not useful for capturing data, changes to applications, and so forth.

Another weak spot is that snapshot imaging software usually requires the source computer to be unavailable. Typically, the image is captured using a pre‐boot environment (like the one shown in Figure 1.8)—a sort of stripped‐down OS that runs in lieu of Windows. This ensures that none of the files on disk are open and changing so that a single, consistent snapshot can be gained. You can see where this would be a bit burdensome as a continual backup technique.

Figure 1.8: Imaging software's pre­boot environment.

"Streaming" images are the kind of block‐based backup I illustrated in Figure 1.5, earlier in this chapter. This technique is the basis for almost real‐time, continuous data protection of multiple servers. You could use the technique with desktop computers, too, although I suspect doing so wouldn't be practical—it would involve a lot of backup data flying around your network, not to mention a lot of storage. No, I think this technique is really geared best for the data center, where you're backing up servers and where you can easily set up high‐speed or even dedicated network connections between those servers.

Application‐Specific Ideas

Some applications—primarily mission‐critical, always‐on applications such as Microsoft SQL Server and Exchange Server—present their own challenges for backup and recovery. These applications' executables are always running, and they always have several data files open, making it difficult for file‐level backup software to get a consistent snapshot.

To help address this, the applications' designers take varying approaches. SQL Server, for example, has its own internal backup and recovery capability, which is tied to the product's own unique architecture. Traditionally, the best way to get a SQL Server backup is to ask SQL Server to do it. You might, for example, use SQL Server's own tools to produce a backup file, then grab that file with a traditional file‐based backup solution. Or, you might create an agent that taps into SQL Server and gets the data that way—the approach used by most enterprise‐level, Backup 1.0‐style backup solutions. Figure 1.9 shows a common dialog box for a backup solution's configuration, showing that a SQL Server‐specific agent is loaded and able to stream data from SQL Server to the backup software. Exchange Server functionality might work similarly—in fact, you can see that tab in the figure as well.

Figure 1.9: Application­specific backup agents.

Exchange Server's developers took a slightly different approach, choosing to integrate with Windows' Volume Shadow Copy service. Essentially, they provide a copy of the Exchange data files through Volume Shadow Copy; a backup solution simply needs to access the Volume Shadow Copy Application Programming Interfaces (APIs) and request the "latest copy" of the database. Again, it's Exchange Server that's doing most of the work, but a dedicated agent of some kind is usually needed to get to the right APIs. As Figure 1.10 shows, even Windows Server 2008's built‐in backup can be extended to "see" the Exchange Server databases.

Figure 1.10: Backing up Exchange Server in Windows Server Backup.

The downside is that these application‐specific approaches are still Backup 1.0 in nature, meaning they're grabbing a snapshot. You're still at risk for losing data and work that occurs between snapshots; particularly with these mission‐critical applications, I think that's just unacceptable.

Block‐level backups can certainly solve the problem because they're grabbing changes at the disk level and don't particularly need to "understand" what those disk blocks are for. A disk block that's part of a file looks the same as one that's part of a SQL Server database, so the backup solution just grabs 'em all. But from a recovery viewpoint, your backup solution does need some additional smarts. Here's why: A simple file—say, a Word document— consists of several blocks of disk space. It's easy to keep track of which blocks make up any given file, and no disk block will ever share data from two files. If you need to restore Salaries.xls, you figure out which blocks that file lives on, and restore just those. Easy.

With complex data stores—such as SQL Server and Exchange Server—things aren't so easy. A single mail message might occupy multiple blocks of disk space, but those same blocks might also contain data related to other messages. The database also has internal pointers and indexes that need to be restored in order for a given message to be accessible. So a block‐based backup doesn't need much in the way of extra smarts to make a backup, but it will need some cleverness in order to restore single items from that backup. Solution vendors tend to approach this by using plug‐ins: It's easy to think of these as being similar to the Backup 1.0‐style agents, but they're not. These plugins don't necessarily assist with the backup process (although they may record special information to assist with recoveries), but they do contain the smarts necessary to peer "inside" complex data stores to recover single items.

Our Backup 2.0 Wish List

The following list highlights desirable capabilities in a Backup 2.0‐style backup solution:

  • Continuous data protection that's always on, always working, and as close to realtime as is practical
  • The ability to move snapshots of my backups to tape (or other portable media) for off‐site storage
  • The ability to restore anything from a single file to an entire server—quickly, and to different hardware (or virtual hardware) if needed
  • Block‐level imaging that provides the ability to roll back to any point in time and keeps backups physically separate from the source
  • Automatic notifications of problems—such as corruption—with the backups
  • Doesn't interfere with complementary solutions such as Windows' own Volume Shadow Copy feature
  • The ability to easily test restores, from a single file to a complete disaster, ideally using dissimilar hardware or virtualization
  • The ability to restore single items from complex stores like SQL Server or Exchange Server

I'll add a few things to this list as we progress through the upcoming chapter, but this is a good start. It represents everything that the Backup 2.0 philosophy is all about, and it meets all the implied requirements in our business‐level statement:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

What's Ahead

I've got a full plate coming up for you, starting with the next chapter in this book, where I get to share some of the horror stories I've run across with my consulting clients, in the news, and so forth. It's kind of interesting to see the problems others have had, but it can be instructional, too: I'll examine each case and draw conclusions about what went wrong and what you would need to do to avoid that situation.

Chapter 3 is where I'll dive into the actual technology of whole‐server backups. This is an area where I think "Backup 2.0" really has some immediate benefit, but I'll start by examining more traditional whole‐server backup techniques and identifying things that don't always work so well. If you're responsible for backing up domain controllers, infrastructure servers, Web servers, a public key infrastructure, or similar servers, then this is the chapter for you.

In Chapter 4, I'll tackle the tough topic of Exchange Server backups—tough enough that the backup software that shipped with Windows Server 2008 couldn't even do it. Again, I'll lay out some of the more traditional ways that Exchange backups have been made, and then rethink the process and come up with a wish list of backup capabilities that include several

Exchange‐specific concerns. Chapter 5 will follow the same approach for SQL Server, and Chapter 6 will examine SharePoint in the same way.

Chapter 7 will be a bit of a departure, as I'll focus on virtualization server backups. This is still a relatively new field, and I'll look at ways in which traditional backup techniques are being used and examine how well they're actually getting the job done. I'll cover some of the unique aspects of virtualization backups and examine innovative techniques that Backup 2.0 can bring to the table.

In Chapter 8, I'll pull back a bit for a broad look at other backup concerns and capabilities: Bare‐metal recovery, data retention concerns, compliance and security issues, mobile infrastructure problems, and so on. I'll also look at instances where old‐school backups might still provide some value, and I'll offer advice for integrating Backup 2.0 with more traditional techniques.

All of the backups we're making are going to require some serious storage, so I'll use Chapter 9 to focus on storage architecture. I'll look at how backup data is structured, and compare the advantages and disadvantages of things like storage area networks (SANs), tape drives, local storage, and so forth, and examine pressing issues of storage: compression, encryption, security, de‐duplication, and so on. I'll also look at unique ways that Backup 2.0 allows you to interact with your backed‐up data more easily and efficiently. Chapter 10 will focus on disaster recovery—and I mean real disasters. I'll look at things like bare‐metal recovery, and I'll cover some of the more interesting capabilities that today's technologies offer, such as using virtualization—rather than dedicated off‐site facilities—as part of a disaster recovery plan.

Chapter 11 is for all the business‐minded readers out there; this chapter is where I'll discuss the costs involved in re‐architecting backup and recovery to use Backup 2.0 techniques. Naturally, I'll also help you determine whether doing so is actually worth it to your organization, and even tackle some of the non‐technical—what I like to call

"political"—issues that you may have to solve in order to make your backup situation more modern and efficient.

Finally, Chapter 12 will be the chance for me to share stories from my own experiences with Backup 2.0—a sort of "tales from the trenches" chapter, with case studies that describe successes (and challenges) I've seen with Backup 2.0.

We have a long journey ahead of us, but you've made a good start. I look forward to seeing you again in the next chapter.

12 Horror Stories—We Thought We Had a Backup!

Horror stories. Tales from the trenches. Case studies. Call them what you will, I love reading them. They're a look into our colleagues' real‐world lives and troubles, and an opportunity for us to learn something from mistakes—without having to make the actual mistakes ourselves. In this chapter, I'm going to share stories about backups to highlight problems that you yourself may have encountered. For each, I'll look at some of the root causes for those problems, and suggest ways that a modernized "Backup 2.0" approach might help solve the problem. Some of these stories are culled from online blog postings (and I've provided the original URL when that is the case), while others are from my own personal correspondence with hundreds of administrators over the years. One or two are even from my own experiences in data centers. Names, of course, have been changed to protect the innocent—and those guilty of relying on the decades‐old Backup 1.0 mentality.

The important takeaway is that each of these stories offers a valuable lesson. Can you see yourself and your own experiences in these short tales? See if you can take away some valuable advice for avoiding these scenarios in the future.

Corruption: Not Just for Politics

I'll start with this story, as it's one I imagine every administrator has been able to tell at least once in their careers.

This must be the oldest backup story in the world, and you have probably heard it a million times: We dutifully take our backups every night, even though it takes forever on the big machines. When we finally need to use one, the backup is corrupted. Either it's the tape, which is actually pretty rare, or something went wrong and the backed up data itself is no good. And so the boss wants to know why we even have jobs since we are all useless. Aaaah!

As I said, we've all been there. Two problems, both related to the old‐school Backup 1.0 mentality:

  • Backups shouldn't take forever; they should take constantly. That is, a point‐in‐time backup is just a snapshot. Even if the backup and restore work perfectly, you're still losing data. With continuous backups, you're not losing data—it's really that simple.
  • Why don't backup programs tell you when they see corrupted data? This should be the simplest thing in the world—our industry has had checksums and other means of validating data pretty much forever. It drives me crazy that we have to discover problem backups by poring through log files.

This is why we have to rethink backups: They just don't do what we need them to do. Even when we do everything perfectly, there's a huge risk that the backups just won't work. We need to rethink what we're doing, how we're doing it, and how we're monitoring it to make sure it works.

Of course, I won't mention that testing those backups might have revealed our author's problem before he actually needed those backups. Another problem with the Backup 1.0 mentality is that nobody (well, hardly anybody—some of the upcoming stories do give me hope) seems to test their backups.

Crackberry Withdrawal

If ever one device has managed to somehow make email even more mission critical than it already was, Research in Motion's Blackberry family must be it. When something goes wrong with the Blackberry infrastructure, it's a race to see what you spend more time on: fixing the problem or answering the phone calls from users who are sure they're the only ones who've noticed the problem:

I work in an environment that—like everyone, I guess—uses Blackberries and cannot live without them for one moment. The infrastructure is actually very complex: In addition to our Exchange Server, there's also the Blackberry Enterprise Server (BES) and a supporting SQL Server. The last time we had a failure, it took us 5 hours to get back online, which I thought was pretty good—but there was an inquisition afterwards to find out why we were offline for so long. Even with all the right tape backups, do you have any idea how long it takes to restore a SQL Server and the BES? Apparently too long for my boss.

It's the Backup 1.0 mentality: Rely on those backup tapes, even though just streaming the data off of tape can take hours—assuming it's all in good shape, not corrupted, and that the data on the tape is actually a good backup in the first place. This is exactly where Backup 2.0 methodologies can make a difference: With a disk block‐based backup, streamed into the backup store in almost real time, you can restore an entire server to almost any point in the recent past by pushing a button. Does 5 minutes sound better than 5 hours? By

restoring into a standby virtual machine, 5 minutes is entirely realistic.

Exchange Failure

Nobody likes it when the Exchange Server goes down. Honestly, I think some companies could go without a file server for longer than they could live without email—especially companies whose employees have Blackberries. So here's a story to move your emotions:

When my company's Microsoft Exchange Server failed at the end of the quarter, it could not have happened at a worse time. It began with the VP of Sales yelling "Email is down, and customers can't send us their orders!" Then my Blackberry started going off, calls, emails, IMs—it was relentless. When I logged on to the Exchange Server, I found that some of my most critical mail stores were no longer mounted. When I tried to remount them, I received the ambiguous yet ominous JET1601 JET_errRecordNotFound error message. I immediately connected to the replication server that runs at one of the company's remote sites, only to find that I couldn't mount those mail stores either.

When I called Microsoft, technicians prescribed the standard procedure of running Eseutil. They warned me, however, that the error message probably indicated a corruption problem deep within the database and that running Eseutil might result in cleaning the stores of all user data. I took the leap, on the chance that it would be quicker than getting the restore process underway. Running Eseutil took hours, then failed with the even more ambiguous JET ­1003 JET_errInvalidParameter. At that point, I knew I HAD to go to the backup.

My company runs full backups every Saturday night and incremental backups the rest of the week. I started by recovering the most recent full backup, then applying the incrementals until I had the backup from the night before the failure. As you can imagine, the calls, emails, etc. kept coming all the while I was copying the mail stores from my disk to disk backup—although they did taper off a bit after 11:00pm, when our West Coast office closed.

Once our data was back on the primary server, it was time to roll the logs and mount the database. However, when the logs were about 80 percent applied, they failed with the JET ­501 JET_errLogFileCorrupt message. At that point, Microsoft support could only suggest running Eseutil through my entire log chain, noting the corrupted log, deleting anything except log files from the log directory, and deleting the corrupted log and all the logs created thereafter. Then I could finally restart the log roll operation from scratch. This procedure took more than 6 hours. In the end, my company lost 2 days of email messages, and recovery took more than 30 hours. The cause turned out to be a problem with the RAID controller driver that had taken months to manifest itself after a previous server upgrade.

Executive management figured it cost the company about $50K, so they definitely wanted to know what had happened and how it could have been prevented—and how it would be prevented from happening again. Let's just say "wanted to know" means that if I didn't have a good answer, my name was going on the top of the next layoff list. I was seriously committed to finding a better recovery solution.

So here's what I learned on my worst day as a network admin: You can have multiple copies of your data—on replicated servers, on disk, and on tape—but if you can't mount the copies, they aren't any good.

The Backup 1.0 mentality cost this company $50,000. Why? Because the Backup 1.0 mentality focuses on making backups, not restoring data. With Backup 1.0, we tend to focus on backup windows, tape drives, and so on—we don't tend to focus on what will happen in the event of a disaster. Even with their backups and 30 hours of effort, the company still lost 2 days of emails—this is a backup plan?

Exchange Server is certainly a complex and difficult product when it comes to backups. The back‐and‐forth between the Exchange Server product team, the Windows product team, and Microsoft's own backup products (in the System Center family) means that Exchange and Windows alone don't offer an effective backup plan. Third‐party vendors, however, tend to focus on Exchange‐specific agents that just make copies of the data as handed over by Exchange. As this horror story points out, Exchange might not always be handing you good data, meaning your backups are useless.

So how would Backup 2.0 change things? We'll cover Exchange Server backups in detail in Chapter 4 of this book, but for now, let's just compare this situation to my wish list form the previous chapter:

  • This fellow's situation could have been improved immeasurably by having continuous backups rather than point‐in‐time backups of log files and databases.
  • Block­level imaging, which grabs the changes as they hit the server's hard drives, would have provided an uninterrupted stream of backup data all the way to the point where Exchange's data files were corrupted. The server could quickly have been "rolled back" to that point in time.
  • Exchange Server's native backup technologies could still have been used, if desired.
  • Our poor admin could have tested his restore capabilities more frequently, been more familiar with the process, and likely spent far less than 30 hours getting his server back online.

Remember: Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible. The Backup 1.0 mentality originally employed in this horror story certainly didn't meet any of those criteria.

Migrating the Cluster—Or Not

Sometimes, "disaster" doesn't always mean a failed server. Sometimes a solid lack of planning can provide all the disaster you need!

We received the new servers for the new cluster. The job: Swap out an old Windows for a brand new one, configure it, and move all the files and data over to the new cluster. This has to be done within a late‐night maintenance window, which means my wife will not be happy again, but I'll buy flowers. Start time is around 11:00pm, and it must be done by 7:00am—8 hours.

Both clusters would eventually be running the same version of Windows, once I got everything installed. The cluster servers came with no OS installed at all, though, so installing Windows would be my first step. Fortunately, the server hardware was basically the same in both clusters.

After driving 300km—a horror driving on Polish roads—I set up the server hardware. The old cluster is humming along next to me, and I'm ready to get Windows installed. "Where are the drivers?"

And the problem arose: Someone had lost the server manufacturer's drivers disc. I know what you're thinking—just download them, right? Well, suffice to say that this is a very secure organization—perhaps governmental—and nobody in the building at that hour could actually get to the Internet. So I had to pull out my mobile phone and, despite the high cost of data transfers, download the drivers for the server over the 3G cellular network. Then my phone's battery died—and me without my charger.

I asked them about server backups, figuring we could perhaps just use those to restore to the new machine, but all they back up are the files. They said they had never been able to do a bare‐metal restore using their backups, so they just stopped backing up the operating system (OS).

Internet access was available again at 8:00am, and someone had to take me into the secure area where Internet access was available. I was tired, the entire night was wasted, and I had to do it all again the next night when I finally had the drivers in hand.

Ouch! We've all had a late night like that at some point, and it's never fun. And the first thing I have to ask myself is why they couldn't simply take a backup of the old cluster machines and restore them on the new hardware? Because in the Backup 1.0 world, restoring to dissimilar hardware is often impossible or at least "not recommended." As a result, many organizations just back up their files—but a backup is useless without someplace to restore it. A more modern Backup 2.0 mentality, however, would say that this cluster migration was really no different than a bare‐metal restore after a complete cluster failure—why not restore the backup to the new hardware, and call it a night? In the Backup 2.0 world, we'd be taking block‐level backups of the entire server, so we could just apply that backup to the new hardware (which we're told was substantially similar to the original hardware) and call it a night. Less time on the job, a lower cell phone bill, and a less frustrated wife.

Virtually Exchange

It's becoming more and more common to run server software inside virtual machines, often on a virtualization host running VMware or Microsoft's Hyper‐V. But you'd be surprised how this scenario might impact your disaster recovery situation:

After just 2 weeks on my new job, our Exchange Server decided to operate in "comatose" mode and stop sending and receiving email. I found that it was running on a virtual server and was being backed up daily by a third‐party software package. I figured I was fine.

The backup administrator, it turns out, really had no idea about how Exchange worked or what, if anything, was needed to complete the backups properly. The backups had never been tested, so I was starting to think fondly of my old job.

I was able to get the server partially operational but wound up having to call the software vendor—only to get a refund for my support call and a terse statement that they didn't (at the time) support Exchange running in a virtual environment. Oops.

I wound up having to 'port my half‐functioning system to a physical server and was eventually able to recover everything except for 2 days' worth of email, which does not endear you to your new bosses, let me assure you.

"Not supported in a virtual environment" is kind of a cheap trick for a support organization to pull, but it happens. That's why it's critical to work with software and backup solutions that explicitly support virtual environments. In a Backup 2.0 world, virtualization is expected; anyone who thinks that virtualization is still novel or unusual is a dinosaur.

Of course, it's also critical to test those backups, and frankly the backup solution—once installed and configured—should just work. If there's something "special" needed to make a valid Exchange backup, the solution should know about it and take those steps automatically—that's why it's called a solution, not an additional problem. The right vendor will have all the knowledge they need on their own staff, and should make a backup tool that does the job you paid for. It should make testing those backups easy—especially in a virtualized environment where you don't even have to come up with another physical machine to do a test restore on!

For Want of a Check Box

Windows' native backup capabilities are, and always have been, pretty primitive. They're also firmly entrenched in the Backup 1.0 mentality of "just make a copy of the necessary data." Sometimes, that mentality can really bite you where it hurts:

It was late 2006, and I was working for a branch of the US government. One afternoon, we began receiving large numbers of calls at the Help desk that people were having trouble logging into a particular Active Directory (AD) domain. Now, this particular domain was used for certain special projects, and only had two domain controllers, but they did support about 1000 users—all of whom were pretty upset that they couldn't log on and access the resources they needed.

Both domain controllers, it turned out, had somehow been corrupted. I never figured out what happened, but I was even having problems logging on to the console directly on either computer. So along with a couple other admins, I decided to just do a full restore of the entire domain. After all, that's what backups are for, right?

To my horror, I discovered that while the server was indeed being backed up every single night, nobody had ever checked the "System State" check box to ensure that AD was backed up too. So the backups were essentially useless.

It took four administrators 96 hours, working around the clock in shifts, to rebuild that domain by hand. We slept an hour or two at a time at our desks, then got back up and started working again. We started on a Tuesday and didn't leave until early Saturday morning—all wearing the same clothes we'd work in on Tuesday. Somebody at some point had the presence of mind to go pick up some toiletries for us, but it was the longest Tuesday I can remember—all because one stupid check box wasn't checked.

This drives right back to the Backup 1.0 mentality: Only back up what we absolutely need.

That attitude comes from a few facts of life in a Backup 1.0 world:

  • "Backups take a long time." We seek to minimize the time it takes to pull a backup, so we cut out "unnecessary" files, folders, and systems. In this case, the "unnecessary files" were unfortunately the very thing that was actually needed.
  • "Backup windows are limited." The very concept that we can only back up our data during certain dead or slow times is crazy, when you think of it. Why wouldn't we be backing data up while it's changing so that we can capture every change? What about users who might have just been created on the day of the failure—they won't be in the previous night's backup, so we lose all that work?
  • "The backups are on a schedule." You hope. Too often you find out—when it's too late—that the backup wasn't working like you thought. The problem is that the schedule is usually late at night when nobody's around to notice a problem—and nobody thinks to check every morning.
  • "We don't need to test our backups." This translates as, "it's too much of a pain in the neck to test our backups, and we don't have the time, so we're not going to do it." Testing, of course, is how you figure out that your backups aren't working, and find out before you actually need to rely on them.

A solid Backup 2.0 mentality changes things around:

  • "Backups are continuous." They don't "take a long time" because they're always running, grabbing every little change that comes down the line.
  • "Backup windows are unlimited." They're 24×7×365, in fact, because the "backup window" is always open, using a solution that grabs every change as it happens. This point, combined with the previous bullet point, means you just go ahead and back up everything. When you're not picking and choosing data to back up, you wind up always having whatever you need to do a restore.
  • "The backups aren't scheduled." They're always running, and you can always see that happening in real time.
  • "It's easy to test backups." Whether restoring to a virtual server or a physical one, testing backups should be easy and fast so that those tests can become routine. If you're not going to test your backups, why bother making them in the first place?

Weight Loss Plan

I was lead tech for a big server upgrade for a company that had to keep records for 7 years. We were imaging the old server data up on the network, then re‐imaging back down [to the new server].

Well, one day I thought I would multitask four machines at the same time and go to lunch. When I came back from lunch, I did a wipedisk of the old hard drives and noticed that one of the techs kept leaning over the work bench; his belly was so big that it would hit the space bar, which would cancel the data transfer.

Well, to say the least, I found that he had done this on the day I went to lunch. Well, no one had any backups, so 7 years of data was gone.

I know it's not funny, but…it kind of is. The lesson to learn here, though, is that point‐intime images are really no better than an old‐style tape backup. Anything that is just making a copy of the data at a particular point in time is Backup 1.0 mentality, where we're more concerned with getting a copy of the data—and, as we've seen in this story, a lot of strange things can go wrong to interrupt that copy and make it useless.

So how would Backup 2.0 solve this problem? By having a continuous backup of the old server in the first place. There'd be no need to take an image during a server upgrade, and in fact, the upgrade could be done faster and more smoothly by just relying on that blockby‐block, Backup 2.0‐style backup. An upgrade is really no different than a bare‐metal disaster recovery—just perhaps less urgent. So a good backup solution should be able to assist with a server upgrade or migration—all the more reason to have a good backup solution in place.

So, 640KB Isn't Really Enough for Everyone?

The amount of storage used by modern organizations is truly staggering. In fact, getting enough time—and network bandwidth—to back up all that data can often be impossible. Impossible, that is, when you're living in a Backup 1.0 world that's focused on point‐in‐time copies of data:

I am the only Storage Administrator for a medium‐sized enterprise. We have more than 280 terabytes of storage used on various vendors' equipment. When I first started, we were running all of our backups through a silo with LTO1 drives in it.

We couldn't ever back up all we needed to in the backup window we were given. We finally upgraded to LTO4 drives after several failed attempts to recover critical data and servers, along with complaints about network slowdowns, which, of course, came right to me.

A bit of digging revealed that the company was actually pulling weekly full backups of 30 to 45 terabytes, which each required about 14 hours to complete. 14 hours! And obviously someone is having to pick and choose the data that gets backed up because they're only backing up about 16% of their data.

Once again, it's the most insidious points about the Backup 1.0 mentality: You have to grab backups during some fixed point in time, you only have a certain amount of time to work with, and you have to grab what you can during that window. Backup 2.0, by continually grabbing changes as they occur can back up much larger data stores, keep the backups entirely up to date at all times, consume less network bandwidth in doing so, and grab all your changes—not just the data you can grab in a 14‐hour window.

This story brings out some other weaknesses of Backup 1.0. Notice that the author started with LT01 drives but eventually had to upgrade to LT04 drives for the improved speed and reliability. Backup 1.0—by forcing us to live within backup windows—often forces us to spend a lot more on our backup infrastructure than is necessary. You could easily drop $15,000 on an LT04 tape library equipped with a fast Fibre Channel 4Gbps network interface—and every penny of that expense is simply to allow you to cram more data into your backup window. But if you expanded that window to 7×24×365 through continuous backup, you'd be able to capture everything—on significantly less‐expensive hardware.

SQL Server Syndrome

I've worked with SQL Server for many years (since v6.5, I think), and organizing backups has always been a challenge. My story is a lot like this one:

I have backups in three versions of SQL Server, in development, production, and testing environments, on multiple servers. Lots of backups, in other words— probably 45 to 50 instances total, with as many as 45 databases per instance.

We grab full backups every night, and transaction log backups every 15 minutes. So our maximum data loss, if everything goes right, is 15 minutes. We test our backups by restoring the full backups to our test environment. That helps keep the test environment closely matched to the production environment, which is valuable for the software developers, and helps us make sure those backups are working. We roll last week's backups into a different folder and grab those to tape.

We get failure notifications through email and System Center Operations Manager, and we rely on manual inspection of backup jobs and folders to make sure everything works.

Right now we're trying to play with different backup schedules and look at using differential backups, all to lessen the network traffic and the drive space occupied by the many full backups. However, we need to keep our recovery times short, so we're testing to see how much overhead the differentials add to recovery times. Have you ever gone on vacation for a couple of weeks and forgotten to put a hold on your mail? You come back to an enormous pile of mail and spend hours going through it all. But dealing with that same mail spread out over the course of a week is much easier, right?

Backups are the same way. The Backup 1.0 mentality tells us to wait until the evening or the weekend to grab all of our data. That means we have to hammer the network hard to pull all that data over to the backup server quickly (after all, the time window is only so big). Wouldn't it be easier if we could just stream the changes over constantly, as they happen? Reading one postcard a day isn't a big deal; coming back to a stack of them at the end of the vacation is what's painful. "Streaming the changes" is what Backup 2.0 is all about.

Remember, too, the goal of backups: Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

Losing 15 minutes' worth of work, just because that's when you made your last transaction log backup, is nuts. Why should any data be at risk? Point­in­time backups always place data at risk; continuous backups don't. I want to revisit my Backup 2.0 wish list, from the previous chapter, one more time—and see how it applies to this case study:

  • Continuous backups eliminate backup windows—no more at‐risk data, no more pounding the network during a backup window.
  • Backed‐up data can be moved to tape on a daily basis for safekeeping without all the cumbersome rolling of files. Who became a DBA so that they could spend time managing files on disk?
  • Being able to restore either a single database or an entire server would provide the test environment with a lot more flexibility, in addition to being an excellent test of restore capabilities.
  • Block‐level imaging would allow any database to be rolled back to a point in time— without having to figure out what combination of differentials, full backups, and log backups is needed. What's more, block‐level imaging permits restoration to any point in time, whereas our author's transaction logs only provide "rollback" in 15minute increments.
  • Still need those old‐school backup files for other purposes? Fine—a true Backup 2.0 solution won't interfere with them.

It just seems as if Backup 2.0, as a set of practices and capabilities, is much more well suited to SQL Server than the old‐style Backup 1.0 mentality our author is relying on. We'll look at SQL Server in more detail in Chapter 5.

Patch Problem

This story really illustrates why I dislike point‐in‐time backups so intensely. Sure, you can achieve a lot of business goals using Backup 1.0 techniques and tools, but you have to be so very careful in order to get exactly what you want. Who needs that extra mental overhead?

I work in one of my company's larger data centers, and support about 100 servers. Most of these are file servers, but there are a couple of domain controllers, a few SQL Server machines, and three Exchange Server boxes.

We are very good about patching our computers. We typically will not apply a set of patches until we have a full backup of the computer's OS and application files, and we usually make a backup right after applying the patch, too. We tend to apply patches during maintenance windows when the servers aren't otherwise available. On some of our larger servers (the Exchange machines come to mind), it gets difficult to grab one full backup and apply patches during our 6‐hour maintenance windows (you try taking email away from people for longer), so sometimes we take a backup one night and apply patches the next, then take a second backup the following night.

The system works—but not always well.

I can recall a couple instances where Exchange patches have caused problems with some of our third‐party software, and we needed to roll back to the pre‐patch backup. Unfortunately, a whole day of work had passed since that backup was made, so we lost all that work. People become incredibly unhappy when email goes missing.

In at least once instance, we didn't realize a general Windows hotfix was causing problems for about a week. At that point, the pre‐hotfix backup was pretty aged. This was on a domain controller, so we decided to apply the old backup anyway, knowing that the domain would bring itself up to date through replication. Unfortunately, the backup also—we found out—had some deleted objects in the domain, which were near the end of their tombstone life. The practical effect was that about a dozen formerly‐deleted objects suddenly reappeared in the domain. Our security auditors freaked out, people were yelled out, and it actually took us a while to work out what had happened, since that's not a scenario you see every day.

We've since decided to rely less on backups for undoing patches. We've started spending more time testing patches, which is of course a good idea but it's very boring and it takes a lot of time we didn't really have to spare. It also means our patches only get rolled out about every other month, rather than every other week, and I worry about what happens when one of those patches fixes some major security hole—and we have to leave the hole open for 2 months just because of our processes.

Again, the Backup 1.0 mentality has deeper‐reaching effects than just disaster recovery problems. In this instance, the company has actually decided to run out‐of‐date software for longer simply because of the way their backup processes work. Unbelievable. If ever there was a case of the "technology driving the business" rather than the other way around like it should be—this must be that case.

There are easy‐to‐recognize problems here, which should be familiar to you at this point:

  • Backup 1.0's point‐in‐time snapshots don't provide much granularity when it comes time to roll back something.
  • Backup 1.0's reliance on backup or maintenance windows took away some of the author's flexibility with regard to his Exchange infrastructure.
  • Except in a few special situations, Backup 1.0 tends to be an all‐or‐nothing proposition: either you roll back the entire server or you live with what you've got. There aren't many good ways to restore a single application; Backup 2.0, by contrast, can more easily pull out just the bits related to a specific application and restore it—with a single click.

Virtual Hot Spares

Virtualization is being used in more and more creative ways—and it's sad when Backup 1.0 can't keep up. Here's an excellent story about using virtualization to provide hot spares for critical computers—I can see this technique eliminating the old‐school rental facilities where you'd have a bunch of physical servers ready to act as standbys in the event of a disaster. But see if you can spot where Backup 1.0 methodologies wreck the elegance of the solution:

My organization used to rely on an outsourced "hot site" for disaster recovery. The theory is that, in the event our data center was hit by a meteor or something, we would relocate to this offsite facility. They have lots of servers handy, and we'd just restore our latest backups to those servers. Sure, we'd lose some data—but we wouldn't lose it all. We could then operate out of that site until our own data center was brought back online.

The cost for these facilities can be staggering, so we've recently constructed our own recovery center in one of our larger remote offices. We can't afford to buy all the servers we'd need, so we are relying on virtualization. We've identified our two dozen or so most important servers, and we have enough servers in the spare facility to virtualize all the critical servers. Performance won't be tops, but in the event of a disaster that serious, we're okay with the tradeoff.

To keep these hot spares working, we regularly take offline our critical servers for maintenance and do a physical‐to‐virtual (P2V) conversion, converting each physical server into a virtual machine in the spare site. We actually do the P2V conversion locally during the maintenance window, then copy the new virtual machine images later because the files are of course huge and the WAN can't support giant copies like that very quickly.

In theory, that means our critical servers are always ready to go and are no more than a week or so out of date. Our plan would be to restore our more recent backups to each virtual machine to bring it even more up to date.

Great plan—almost. Having to pull servers offline to do a P2V migration is nuts. Why take servers offline at all? The bones of this method are a good idea, but the whole Backup 1.0 "snapshot" mentality—which most P2V migrations play into—is messing things up.

Consider this instead: Use a Backup 2.0‐style solution, which makes continuous, block‐level backups of your source servers without taking them offline. Restore those backups to an empty virtual machine—one without even an OS installed. This is essentially the same as a "bare‐metal" restore, just that the metal in question is virtual. Your backups will always be up to date to the latest changes, and you can do a weekly restore to your "hot spare" virtual servers so that they're ready to go at a moment's notice. Or, if your backups are being stored safely—so that a complete disaster in your data center won't also take out your backups—you could just do yet another bare‐metal restore when disaster strikes, and your hot spares will be virtually indistinguishable from the real servers.

By the way, where to keep your backups so that they're safe is obviously a key part of the strategy—and it's something we'll examine in more detail in Chapter 9.

Is it a Server or a Photocopier?

This scenario outlines a problem that a lot of us have, but that we often don't think about. It's not uncommon for companies to exercise storage resource management on their file servers—prohibiting certain file types, for example, or restricting users to a certain maximum amount of disk usage. One oft‐quoted reason for this kind of storage resource management is that storage is expensive to manage and maintain—especially backups. But it's strange that even companies with the strictest disk quotas rarely exercise any kind of storage resource management on the backups themselves:

I recently started working for a new company and was pleased to see that they had a very well‐implemented Distributed File System (DFS) infrastructure. I don't like mapping network drives, and the users in this company had learned to access UNC paths like \\Company\Sales\Files\November rather than relying on the good old S: drive.

The company was also using DFS replicas to help users in remote offices—which often have slower WAN links—get to critical files. If you're not familiar with it, DFS uses Windows' File Replication Service to copy a given DFS leaf to one or more other file servers. So accessing \\Company\General\Policies might actually get you to any one of a dozen servers that all host that same content. In this organization's case, each local office file server has a copy of this and other commonly‐accessed file shares, like \\Company\General\Sales\Forms. The files in these shares aren't updated really frequently (maybe a couple files a day change), so there isn't really much replication traffic, and having local copies lets everyone get to the files without relying on the WAN.

Like any good company, we back up all our servers. Smaller remote offices actually have direct‐attached tape drives for this purpose. They've been using this model for years, but only recently have we started realizing that the backups weren't completing because the tape drives didn't have enough capacity.

I started looking into it, and realized that the company has been doing a lot of DFS replicas—about 50 to 60GB worth right now, and they're adding more all the time. This is the same data being backed up over and over again in different locations. All that duplicated data is what's causing the problem. I started trying to figure out exactly how much duplicated data we have so that I could make a business case to management. In doing so, I found that many of our file servers have multiple copies of the same files, all on the same server. A lot of the time it comes from users' home directories: the two servers that host the \\Company\Users DFS node have tens of thousands of duplicated files because users drag commonly‐used files into their own home folders.

We were backing up a total of 13.2TB of data, and close to 30% of it was duplicated data. One‐third! We're currently trying to figure out how to exclude the duplicates, but it's actually very tricky—we can't just not back up users' home folders!

Data duplication—the bane of storage resource managers everywhere. In fact, data deduplication is an incredibly hot new technology; so much so that industry giants like EMC, Dell, and HP battle it out over acquisitions they believe will put them at the forefront of this important new technology. So important, in fact, that I'm going to add a bullet point to my Backup 2.0 wish list:

  • Backups should not contain duplicate data.
  • Back up something once, not more than once.

This is a tall order, and it might only be possible to implement it to a degree. For example, a solution might de‐duplicate data across an entire server but allow duplicates of that same data to be backed up from other servers. Other solutions might take a broader view and deduplicate data across the entire organization—that would certainly be powerful, but it seems like a technologically‐difficult task.

Horror Stories and the Lessons They Teach Us

These twelve stories have obviously had common themes. Before I wrap up this chapter, I think it's worth spending just a few moments reviewing those themes, focusing on the takeaway (what we should learn from them), and reiterating what we should be doing instead to provide better backup, recovery, and other capabilities for our organizations:

  • Point­in­time backups are horrible. You're always going to lose some data if you have to rely on point‐in‐time backups, and who wants to lose data? Continuous backups, or continuous data protection, if you prefer, offer much more flexibility, less risk of loss, and generally less manual effort and overhead. Point‐in‐time backups also lack granularity—even those quarter‐hour SQL Server transaction log backups leave too much data at risk and don't let you roll back the server to a precise point in the past, if needed.
  • Backup windows are horrible. Taking servers offline is never going to add much value to the organization. Sometimes, for true maintenance, you have no choice; you do have a choice when it comes to backups. Backup windows force you to choose what data to back up—you can only grab what will fit within the window—place unnatural strains on the network, and drive a number of other constraining decisions that simply add no value to the business. Continuous backups don't require a window, so you are free to make better decisions without all the time constraints.
  • Duplicated data is horrible. Nobody likes that we have to do backups, so why back up anything more than once? Duplicated data bloats your backups, requires more time, and more importantly requires more space for backup storage—space that might be expensive (especially for offsite storage), slow (like tapes), and so on. Look for solutions that can automatically de‐duplicate data when making backups.
  • Anything less than the entire server is horrible. Sure, your worst nightmare might be losing a single email from the CEO, but that's far from the only nightmare you need to prepare for. Don't make decisions about what's important—back it all up. With the right solution, that one all‐inclusive backup will enable everything from single‐file restores to bare‐metal server recovery—and everything in between, including restoring an entire application to an earlier point in time.
  • Backups get corrupted—which is horrible. Backup solutions should tell you when something goes wrong—not wait for you to find out for yourself.
  • Not testing your backups is horrible. True, old‐school backup solutions make it incredibly difficult to actually test backups, but that's no excuse. Or rather, it's an excuse for finding a solution that makes testing backups easier, like one that supports restoring to a virtual machine or dissimilar hardware, so that you can practically test your restores.

There's plenty to learn from these stories, and plenty to look for in a Backup 2.0 solution. The key to the whole thing seems to be this idea of continuous data protection, grabbing individual blocks off disk as soon as they change and storing those blocks in a way that makes it possible to restore an entire server or just a single file. Yes, solutions like SharePoint, SQL Server, and Exchange Server add some complexity to the picture, but by continually streaming changed disk blocks into the backup system, you can grab any type of backup you need—continuously, without the point‐in‐time troubles of Backup 1.0.

Coming Up Next

It's time to dig into this Backup 2.0 philosophy a bit deeper. In the next chapter, I'll start looking at backups beginning with the most common type of backup—or what should be the most common type of backup: whole‐server backups. Whether you use them to protect data or to prepare for a complete, whole‐server disaster, these backups should be the staple of your disaster recovery plan.

I'll dive into the technical details and hurdles that whole‐server backups present, and cover some of the native solutions that Windows provides. I'll outline specific problems and challenges that both native and third‐party solutions have to deal with, and look at some of the Backup 1.0 methodologies you're doubtless familiar with. That will all help set the context for a discussion on rethinking server backups: I'll make a wish list of capabilities and features, outline better techniques that more closely align to business requirements, and point out ways that Backup 2.0 can make backup management easier, too. I'll finish by examining specific scenarios—like domain controllers, public key infrastructure (PKI), and Web servers—where these techniques offer an advantage.

Whole‐Server Backups

Recovered from the horror stories of the previous chapter? Ready to start ensuring solid backups in your environment, the Backup 2.0 way? That's what this chapter is all about, and what I call "whole server backups" is definitely the right place to begin. This is where I'll address the most common kinds of servers: file servers, print servers, directory servers, and even Web servers—the workhorses of the enterprise. I'll show you what some of the native solutions look like, discuss some of the related Backup 1.0‐style techniques and scenarios, and detail why they just don't cut it for today's businesses. Then I'll assemble a sort of Backup 2.0 wish list: All the things you want in your environment for backup and recovery. I'll outline which of those things are available today, and wrap up by applying those things to some real‐world server roles to show how those new techniques and technologies impact real‐world scenarios.

The Technical Details

What's so technical about whole‐server backup? Windows stores critical data in a number of places, and some of them are files and databases that are continually open and under modification by the operating system(OS): the Windows registry, Active Directory's (AD's) database, and certain critical OS files are just a few examples of these. These files are difficult to back up and restore simply because the OS itself can't "lock" the files to provide a "clean" image of the file. In other words, because the files are constantly open and constantly changing, traditional backup solutions can't easily "see" the complete file to back it up.

Whole‐server backup also includes user files, such as those stored on a file server, along with numerous configuration databases associated with Windows itself. Although all of these might not be open at all times, they can still be tricky to back up because they may be open during the brief window when the backup software is running.

So the goal with whole‐server backup is to get a good, usable backup of the entire server. And by "usable," I mean that the backup can serve our main business goals related to backup and recovery, which I stated in Chapter 1:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

As little downtime as possible suggests that we need to do more than just back up the entire server in a way that facilitates restoring the entire server; we may also need to recover a single file, or a single configuration database, or a single AD object. Being able to recover just the data we want from a backup will help reduce downtime, as recovering a single file is obviously—or should be—much faster than recovering an entire server.

We also have to recognize that downtime doesn't just apply to the server that we're recovering; it also applies to the people who are waiting for the recovery to be complete. We can get a user back to work faster by recovering the one file that the user needs rather than having to recover the entire server. That said, we'll certainly want the ability to recover the entire server, in the event that a complete disaster occurs and we lose the entire server. When that happens, the ideal is to lose as little work as possible, meaning whatever we're using for backups should be continuous.

Traditional Techniques

The technical details here present one significant technical challenge: open file access. That is, the ability of a backup solution to get a complete, consistent backup of a file that's currently open and in‐use. Over the years, a number of techniques have been developed to address these issues.

One technique is called a locked file backup (LFB), and it's a fairly generic technique that's in use on many OSs. Basically, when an application requests a backup of a file, the LFB logic— which may be embedded in the OS or provided by an agent of some kind—checks to see whether the OS has the file open for any other applications. If it does, the LFB logic waits for a pause in write activity to the file—a pause lasting for some predetermined amount of time. When a pause occurs, the LFB makes a copy of the file as it exists at the time, and offers that copy to the backup application. In some instances, LFB will operate on an entire disk volume rather than on a per‐file basis. Working across an entire volume helps ensure that the files are consistent with one another but also makes it more difficult to find a pause during which all the open files can be copied and cached.

Windows introduced a new technique with its Volume Shadow Copy (VSC) service. I'll discuss VSC's operation in more detail later in this chapter; for now, suffice to say that the VSC keeps a local, on‐volume backup of changed files. It can then offer those files to the backup application when a file's current version is open at the time of the backup. That means the backup is typically getting the previous version of a file rather than the current version; you might say that because the file is open, the "current" version hasn't yet been created.

Both of these techniques are, as you can see, pretty complicated, and they don't work every time. Files that are constantly open will defeat techniques such as VSC, for example. LFB can be defeated by files that never have a long enough pause in write activity, or in the case of whole‐volume LFB logic by large, busy volumes that never have a long enough volumewide pause in write activity.

Native Solutions

I want to take a look at the native backup solutions that are built‐in to the Windows OS. There are really two: the bundled backup application and the aforementioned VSC.

Windows Backup / Windows Server Backup

Prior to Windows Server 2008, Windows Server came with a fairly primitive backup and restore application that was essentially unchanged since its introduction in Windows NT 3.1 in the early 1990s. Figure 3.1 shows Ntbackup.exe, which offered basic backup and recovery of files on disk and, as shown, of the critical System State.

Figure 3.1: Windows Backup, or Ntbackup.

The System State consists of Windows' boot files, the Windows registry, COM class registration (primarily the names and locations of DLL files), and other configuration data related to the OS itself. The application could back up to locally‐attached tape drives or to a file, and it could recover the entire server or individual files. A shortcoming in the wholeserver recovery method is that you first had to install Windows, then use the application to recover the server. In other words, the application didn't support any kind of bare­metal recovery, where the backup could be applied to a brand‐new server that had no OS installed. The need to manually install Windows prior to beginning the recovery added a few hours, at best, to the recovery process, and made the tool unsuitable for all but the very smallest and risk‐tolerant environments.

The application also lacked internal scheduling of backups. Instead, it used the Windows Scheduled Tasks functionality to schedule the command‐line version of the tool. This provided the basic ability for scheduling backups, but from a business perspective, it was suitable only to the very smallest environments.

Although valuing the many third‐party applications that sprang up to fill the backup and recovery gap left by the native application, Microsoft eventually recognized the need to provide somewhat more modern and sophisticated backup capabilities in the OS's native toolset. So, in Windows Server 2008—after more than a decade of Ntbackup—Microsoft introduced Windows Server Backup. As Figure 3.2 shows, this feature must be explicitly added to the server by an administrator.

Figure 3.2: Adding Windows Server Backup to a server.

Once added, the application, see Figure 3.3, offers a number of improvements over its predecessor, including the ability to natively schedule backups and the ability to connect to remote servers and manage their backups. The application can create a recovery disc, which makes it easier and faster to do bare‐metal recovery.

Figure 3.3: Windows Server Backup in Windows Server 2008.

Aside from improvements in its user interface (UI) and the addition of native scheduling, Windows Server Backup isn't significantly different from its predecessor. It still works on a file‐by‐file basis (although it backs up entire volumes), still has the ability to back up the Windows System State, and so on. Some of the UI improvements are great—such as the ability to restore from a backup file that is located on the network—but in many ways, Windows Server Backup provides features that third parties were providing a decade earlier.

Both of these native applications have a major shortcoming in that they are snapshot based, meaning they grab backups only during a designated backup window. Any files that changed after the backup was made, and before the next one was completed, would be lost in the event of a disaster or other problem.

Volume Shadow Copy / Previous Versions

Introduced in Windows Server 2003, VSC is a native, OS‐level feature that creates shadow copies of files on any volume for which the feature is enabled. Administrators specify a maximum amount of storage to be used for the shadow cache, and the service automatically makes copies of shares files that are changed as well as files that are open when a backup is requested.

As Figure 3.4 shows, VSC is configured on a per‐volume basis on the server. Once enabled, it will automatically begin creating shadow copies of files that are contained within shared folders—one reason the configuration UI displays the number of active shares. VSC will maintain several shadow copies for a given file, providing multiple old versions of a file. It will continue making shadow copies until its administrator‐allotted space is full, and will then discard older copies to make way for new ones.

Figure 3.4: Configuring VSC.

Windows Server Backup and later versions of Ntbackup include support for VSC, meaning they can natively make use of VSC stores to back up applications that expose their data through VSC. Such applications include Microsoft SQL Server.

VSC has two distinct functions:

  • It will make shadow copies of open files automatically when those files are accessed for backup. This requires a backup application that is VSC‐aware, meaning the backup application has to ask for a shadow copy. Some applications, such as Microsoft SQL Server, are designed to expose data through VSC, providing a common interface that backup applications can use to access application‐specific data.
  • It will make shadow copies of shared files, and make those shadow copies accessible to end users through the Previous Versions feature in Windows client OSs, such as Windows Vista. This provides end users with a self‐service method for recovering individual files and folders that have previous versions available as shadow copies. Figure 3.5 shows an example.

Figure 3.5: The Previous Versions feature in Windows Vista.

VSC has a few shortcomings: First, it isn't exactly continuous. The feature doesn't make backups of a file each time the file is saved, although it will periodically make a new shadow copy of changed files. VSC doesn't provide fine‐grained control, either. In other words, an administrator can't decide which files are more important and should be retained longer; VSC discards shadow copies in a first‐in, first‐out cycle; administrators can only determine the total amount of space available for all shadow copies. VSC isn't useful in full‐server recovery, only in single‐file backup and recovery. VSC doesn't create shadow copies of files that aren't shared, and it doesn't protect the entire OS—it ignores System State, for example. Generally speaking, most administrators regard VSC as a good supplement to a proper backup application; VSC can help reduce Help desk calls for single‐file recovery by making that functionality more self‐service, but VSC itself is not a backup solution.

Traditional Non‐Native Solutions

Since the introduction of Windows server OSs, their fairly weak native backup and restore capabilities have spawned an enormous ecosystem of third‐party solutions. For years, however, these solutions essentially replicated the basic features of the native backup application. To be sure, they added a great deal of administrative convenience and flexibility, but they did basically the same job. Figure 3.6 shows one such application.

Figure 3.6: An example traditional third­party backup application.

With Windows Backup, administrators could select the files they wanted to back up; with third‐party solutions—as shown—they did basically the same thing. The third‐party solution could compress the backed‐up data for transmission across the network to a central backup server with an attached tape library, of course, which is a major improvement. Many third‐party solutions provide powerful scheduling capabilities, designed to maximize the amount of data that could be copied in a limited backup window. Most provided backup media management, and most provided application‐specific agents to include applications such as Exchange Server, SQL Server, and so on in their backups. Most provided single‐item recovery as well as whole‐server recovery, and most provided the means to perform bare‐metal recovery in the event of a complete disaster. Ultimately, though, their reliance on the same basic principles and techniques as Ntbackup always led to the same basic problems and challenges. It's what makes this entire category of backups feel like Backup 1.0.

Problems and Challenges

Both the native Windows backup capabilities and the similarly‐designed traditional thirdparty solutions have the same challenges and problems:

  • They're schedule‐based, meaning they're snapshot‐based. Without continuous data protection, you're always at risk for losing data. In most cases, backup windows are in the evening, so you're always at risk for losing an entire day's worth of work.
  • Media management is a significant challenge, and many administrators using these solutions spend a few hours each week just shuffling backup tapes.
  • Restoration is a time‐consuming process, as file indexes and data must be loaded from tape. Even restoring a single file can take a long time: the right tape must be located and loaded and read, the file must be selected from a list, the tape must be wound to the correct location, and finally the file can be read. What is it we want from our backups, again?

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

These Backup 1.0 techniques miss out on preventing the loss of any data or any work; there's always data at risk because of these solutions' snapshot basis. They don't ensure we always have access to our data, and the downtime they offer certainly isn't minimal. Even solutions that eschew tape in favor of more expensive disk‐based backup still have to perform an incredible amount of file operations to locate the right data and bring it back into production.

There's a deeper problem that these old‐school techniques also tend to miss: data deduplication. Today's enterprises have a lot of duplicated data—duplicated files in users' home folders, for example. Traditional backup solutions tend to blindly copy everything they see, meaning you're wasting not only space storing that duplicate data but also time and capacity backing it up. Data de‐duplication is starting to become a watchword in storage management, and a Backup 2.0‐style solution would certainly include deduplication capabilities.

In the Old Days

I want to take a brief section to summarize some of the key Backup 1.0 principles and elements so that we can pull out their specific problems and think of ways to solve them or improve upon them.

Backup Techniques

The solutions I've discussed so far rely on a single primary technique with a few supporting ones. Mainly, these solutions are snapshot­based, meaning they seek to back up a file as it exists at the moment. They typically back up groups of files during a single backup window, which is often in the evening or during other periods of low or no utilization. Supporting techniques primarily revolve around backing up open files, and may include open file management utilities, LFB agents or features, or special features such as VSC.

Recognizing that some servers may contain too much data to be backed up in a single backup window, some organizations may choose to use a partitioning scheme. For example, half the server's data might be backed up in full one evening, while the other half receives an incremental or differential backup that evening. These management techniques help make the most of limited backup windows but also complicate restore procedures.

The main problem is that backup is snapshot‐based, meaning there is always some data at risk of loss. A secondary problem is that of duplicated data, which wastes backup storage space.

Restore Scenarios

Restoration typically involves restoring a single file or other resource, often at the request of a user who accidentally changed or deleted a file. Self‐service solutions such as Windows' Previous Versions feature (in conjunction with VSC) work well when the desired data is available for restoration; when such features aren't available or the needed data isn't available, administrators must turn to their backup solution. This typically involves identifying the correct version of the resource to be restored, locating the storage media containing that backup, and then reading the data from that media—which, in the case of tape, can require some time, as the tape doesn't support random access and must be wound to the correct read location.

The main problem is in identifying the needed data and locating the media on which it is stored. With complicated backup schemes, this may involve identifying a full backup, one or more incremental backups, and/or a differential backup. Another problem is in the time it takes to perform the restore, particularly with tape‐based backups. A final problem relates to the snapshot‐based nature of the backup, meaning there will be times when the desired data simply isn't available on a backup.

Disaster Recovery

Whole‐server disaster recovery usually comes in one of two forms:

  • The base OS must be installed first, possibly along with any recent service packs, using standard installation procedures. Then, one or more software applications must be installed to support the recovery. Finally, the recovery can begin in earnest, reading data from the backup media and restoring it to the server.
  • A bare‐metal recovery usually involves a special boot disc that includes a strippeddown OS, such as WinPE or Windows Recovery Environment, that contains enough smarts to access backed‐up data, format the server's disks, and copy the backup data to those disks. This form usually involves fewer steps and is much faster than the previous form.

Both techniques can be time‐consuming and suffer from the inherent snapshot‐based nature of the backup, meaning there will always be some data missing even in a perfectlyexecuted recovery.

Backup Management

Backup management can be complicated using these Backup 1.0‐style techniques. Backup schedules and types (full, incremental, or differential) can be complex, and require careful management not only of schedules but also of storage resources, network capacity, and so forth. Physically managing tapes—rotating them off‐site, verifying their accuracy, and so forth—can be time‐consuming and error‐prone.

These days, backup management is complicated by many companies' data retention requirements. Do the backups include regulated information? In that case, they may need to be retained for a specific period of time or discarded after a specific period of time. They may need special security precautions, as well, to protect the data contained in the backup.

Rethinking Server Backups: A Wish List

Let's start thinking Backup 2.0: What are our old whole‐server backup techniques missing that we'd like to add? The sky's the limit; at this point, we don't need to worry about what's possible or available—just what, in a perfect world, we'd be able to improve.

New and Better Techniques

I think the most important improvement that Backup 2.0 can offer is a change from pointin‐time snapshot backups to continuous data protection. Here's how it might work: Everything starts when an application—of any kind—makes a change to disk. Applications do so by passing data to the Windows OS's file system application programming interfaces

(APIs), basically asking Windows to "save this data to this file." When that happens, Windows' file system takes the data and begins breaking it into blocks.

Blocks are the basic unit of management for data on a disk. When you format a disk, you select the allocation unit—or block—size. That determines how much data is contained within a given block of disk space. Figure 3.7 shows that Windows usually picks its own default value, which is what most people go with.

Figure 3.7: Formatting a disk and selecting block size.

A single block can hold the contents only for a single file. If your block size is 2 kilobytes, for example, and you save a 512‐byte file, then three‐quarters of the block will be empty and wasted. Larger files are split across multiple blocks, and NTFS keeps track of which blocks go where: "file XYZ.txt consists of block 324, then block 325, then block 789," and so forth. Microsoft recognizes that third‐party utilities might need to be notified of file operations, and might in fact need to modify or cancel those operations. That's how most third‐party disk quota systems and file‐filtering solutions work: they get the file system to tell them when files are changed on disk, and they update their quota database or block files from being saved, or whatever. The technique Microsoft provides for accomplishing this is called a file system filter or shim. Essentially, the shim registers itself with the OS, is notified of file operations, and is given the chance to block or allow each operation.

In the case of a Backup 2.0‐style solution, the shim might just pay attention to which disk blocks were being modified. As blocks are modified on disk, the shim could copy the blocks' data, compress it, and transmit it across the network to a backup server. Figure 3.8 illustrates the process I'm proposing:

Figure 3.8: Capturing changed disk blocks.

In Step 1, something on the server modifies a block of disk space. In Step 2, this change is passed to the backup software's file system shim, and the modified block is copied and transmitted to a backup server.

This technique would allow for continuous data protection. Of course, in addition to just copying blocks, the software would also make a note of which file the block went with. On the backup server, software would keep track of files and their associated blocks. This is a powerful technique: Rather than messing around with cumbersome and complicated open file techniques, as Backup 1.0 solutions would do, this system is grabbing the low‐level data changes as they are physically inscribed on the hard disk by the OS. The data is being grabbed below the level of an entire file, so even partial changes to a file—such as a Microsoft Access database—can be grabbed immediately, almost in real‐time, and the backup solution doesn't care if the file happens to be open or not at the time. When someone needs to restore a file, the backup server's software simply looks up the most recently‐saved blocks for the file, and uses them to reconstruct the file in its most recent condition. Best yet, the central backup server could also save past copies of a file's blocks— meaning it could reconstruct the file as it existed in any particular point in time.

If the saved blocks were stored on a high‐speed storage system, such as a RAID array, files could be reconstructed almost instantly and saved to any location on the network.

Considering our prime directive for backups:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

This Backup 2.0 technique would do the trick. We really wouldn't lose any data or work, because low‐level changes would be captured continuously. We wouldn't need to worry about backup windows because the entire day would be our backup window. We'd always be able to re‐construct data as needed, with minimal downtime.

Note: In fact, solutions exist that do just this: Although they may use different under‐the‐hood techniques than the file system shim I've described, the Backup 2.0 idea of copying blocks nearly instantly is very much a real thing, not just a wish.

This technique also has the advantage of being able to capture everything that goes to disk, including file permissions, alternate NTFS streams, registry changes, AD changes, and more, all without necessarily needing any special knowledge of file types or structures.

In fact, this technique is logically (although not physically) similar to RAID 1 disk mirroring, which also operates at a block level—albeit normally via a RAID controller card and not via software, although software RAID 1 is also possible. Rather than mirroring blocks in real time to a separate disk, this backup technique "mirrors" the blocks across the network to a backup server. And, rather than only keeping the current block data, as in a disk mirror, the backup solution can keep past copies, too, enabling recovery to any earlier point in time.

Better Restore Scenarios

The trick to all of this, of course, is having great management software on that central backup server. It needs to be able to track which changed blocks go with which files, for example, and for convenience may want to keep track of special files like the Windows registry or AD databases.

The backup server software could easily expose a self‐service UI, if desired, to provide functionality similar to—but more controllable than—Windows' Previous Version / VSC feature. The software could more easily recover data in applications such as AD because the software could re‐construct a portion of the database file or even the entire database file, down to a specific point in time (in all probability, you would want the software to be able to attach "points in time" to specific AD operations, such as the deletion of a user or the creation of a group or whatever—just so you could more easily identify the point in time you want to roll back to).

Better Disaster Recovery

With a copy of every block of disk space, restoring an entire server would be straightforward: You'd need some sort of recovery disk, such as a bootable DVD, to get the server up and running and to talk to the backup server software. You'd then simply stream the most recent version of every disk block back to the server, write those blocks to disk, and then restart the server normally. With good compression to speed up the transfer of data across the network, bare‐metal recovery could be done quickly—and you'd lose very little, if any, data.

In fact, the opportunity exists to recover the server to a virtual server, if need be—an excellent disaster recovery scenario as well as a powerful physical‐to‐virtual migration technique. You might well be able to recover to dissimilar hardware, too, since in many cases Windows can re‐adjust itself when it finds itself suddenly running on different hardware.

As you begin looking at block‐based recovery solutions, ask for details on how they deal with dissimilar bare‐metal recovery scenarios. Would Windows require re‐activation when it finds itself running on different hardware? How dissimilar can the hardware be? The more flexibility the solution offers, the greater the number of scenarios when it will be able to save the day when it's needed.

Here's a killer scenario: Imagine grabbing real‐time disk blocks from a production server, and then immediately applying ("restoring") those blocks to a virtual machine. Instant virtual standby! If the main production server fails, the virtual server can step in and take over with little downtime and little or no data loss. Depending on how you architect the virtual infrastructure, a single virtual host might support several virtual standbys. Those standbys might operate with less performance than the non‐virtual production machines (again, depending on how you set things up), but you'd have a "hot spare" any time you needed on. Figure 3.9 illustrates this idea.

Figure 3.9: Block­based virtual hot spares.

Easier Management

Managing block‐level backups can be much easier because they'll typically be stored first on disk. You can then make tape‐based backups from the disk‐based backups—meaning your production servers don't participate in the tape‐based backup, and your "backup window" can be as long as you like. Tape backups would truly represent time spans, and could be complete and internally consistent—unlike today's mix of full, incremental, and differential backups, which must be treated as a set in order to retain their usefulness.

Block‐based backups would also be an effective way to implement data de‐duplication, as sets of blocks could easily be indexed and compared to check for duplication. That would help cut down both on disk storage and tape storage as well as all the management overhead that comes with any kind of storage resource.

Note: Some data de‐duplication vendors claim that you can reduce the size of your backups by up to 70%. Even if that's an optimistic number, you could conservatively expect to save a significant amount of space!

The backup software could also, in theory, allow you to mount backed‐up data as a real disk volume. It would simply need to provide some kind of disk driver for Windows that could talk to the backup software and reassemble, in real time, the most recent backed‐up blocks into a single disk volume. You could then mount and explore backed‐up data as easily as live data, allowing you to compare files, drag and drop files to different locations, and so forth. If the backup solution was storing past versions of block data, you could mount a disk that resembled any given point in time, making it easy to compare files or data from different points in time.

Great for…

Again, most of the capabilities I've wished for are available today from a variety of thirdparty vendors. Backup 2.0 for entire servers isn't something you have to wish for; it's something you can choose to do now, helping you achieve the capabilities that backups have always alleged to provide:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

The following sections highlight a few specific scenarios where these Backup 2.0 techniques could save the day.

Domain Controllers

Today, companies spend thousands on AD backup solutions that create point‐in‐time backups of AD and allow for single‐object recovery as well as more complete wholedirectory disaster recovery. A Backup 2.0‐style solution, however, would be able to natively back up AD, because in the end, AD's complex database is just blocks on disk. Certainly, a backup solution would need to offer AD‐specific functionality for restores, or might even offer an AD management console extension that made AD recovery capabilities available right from that console. But when you start using block‐level backups, AD suddenly isn't so difficult to back up. You can get real‐time backups of every change, automatically, with no extra effort.

Infrastructure Servers

Infrastructure servers are perfect for Backup 2.0: You'll capture every DNS change, every DHCP change, and more—on servers that you might only back up once a week in the Backup 1.0 world. Why bother with real‐time backups on an infrastructure server?

  • If you have to do a bare‐metal recovery, things like DHCP leases will still be intact— your network won't have to go through a period of recovery and readjustment.
  • Both static and dynamic DNS records will be maintained—again eliminating the period of confusion that normally occurs when a DNS server crashes and is brought back online.
  • Still using WINS? Even that database can be backed up in real time and recovered to a specific point in time—helping eliminate the need for your network to start massive broadcasts and re‐registrations, as would happen when an empty or out‐ofdate WINS database was recovered.
  • Infrastructure servers can be easily restored to a virtual machine in the event of a complete disaster—even an off‐site virtual machine, making it easier to re‐create your exact production infrastructure, virtually, in a disaster recovery mode.

Web Servers

Web servers are just files and folders, right? Why not capture every change, and make it easier to roll back an entire Web site to a known‐good point in time, including Web server configuration files and metadata? It doesn't matter if you're using IIS or Apache or something else as your Web server, it's all blocks on disk, and a Backup 2.0 solution can grab it all.

Backup 2.0 can even help make Web farm management easier. Designate one Web server as your "master"—perhaps it's even a protected server that doesn't accept live traffic. Back that up in real time using a Backup 2.0‐style solution, capturing newly‐uploaded files from your Web developers as well as configuration changes from Webmasters. Restore those changes to multiple "hot spares" in your Web farm—with no effort on your part, every member of your Web farm now has identical Web content and server configuration files! Your master content would remain protected, so even if one of your Web servers was compromised, you can easily—and quickly—restore it to the most recent, correct version, bringing it back into production quickly and effortlessly.

Want to rebuild your Web farm as virtualized servers? No problem: Just use your Backup 2.0 solution to "restore" your master Web server to one or more virtual machines. Reconfigure a few settings such as computer names, reconfigure your load balancer to point to the new servers, and your Web farm is moved. With the right management tools on top of the basic disk block‐based backup system, it's all possible—and it brings value to your backup solution that extends beyond mere backup and recovery.

Public Key Infrastructure

If you operate a Public Key Infrastructure (PKI), you know how critical those Certification Authority (CA) servers are. With disk block‐based backups, you'll always have a reliable backup of every CA—including every certificate, every public key, every revocation, and every outstanding certificate request. In the worst‐case scenario of a complete CA failure, you can still bring the PKI back up quickly, either by restoring the most recent disk blocks to the original hardware, to dissimilar hardware, or even to a virtual server.

Coming Up Next…

In the next chapter, I'll be taking the same approach as this one but focusing exclusively on Microsoft Exchange Server. There's no question that Exchange forms a big part of most organizations' "mission critical" list, and having solid, reliable Exchange backups is equally critical. But Exchange certainly makes backup and recovery a bit more challenging, as you need to support a high level of granularity from single‐message recovery to bare‐metal server restores. You've got to do all that despite the fact that Exchange's database is essentially a big black box, not a bunch of more easily‐manipulated little files. It's a real test of the Backup 2.0 methodology.

Exchange Server Backups

Ask anyone in the organization what their most mission‐critical piece of infrastructure is, and you'll probably hear "email" as a common answer. Or you might not: Many folks take email for granted, although they expect it to be as available and reliable as a telephone dial tone. Users who have never suffered an email outage almost can't imagine doing so; once they do experience an outage, they make sure everyone knows how much they're suffering. As one of the most popular solutions for corporate email, Exchange Server occupies a special place in your infrastructure. It's expected to be "always on," always available, and always reliable. Disasters simply can't be tolerated. What's more, users' own mistakes and negligence become very much your problem, meaning you have to offer recovery services that are quick and effective, even when you're recovering something that a user mistakenly deleted on their own.

Native Solutions

Exchange Server's native backup and restore capabilities are tied in part to the underlying Windows operating system's (OS's) capabilities—which isn't always a good thing. Part of Exchange's recovery capabilities come from the fact that deleted messages aren't actually deleted from the system. Email clients such as Microsoft Outlook automatically move deleted messages into a Recycle Bin, where they stay for a configurable period of time or until the user manually empties the Recycle Bin. When using other email clients, such as a generic IMAP client, deleted messages are retained on the Exchange Server computer even if they're not actually moved to the Recycle Bin; deleted messages are simply left in their original folder and hidden from the user's view until a configurable amount of time has passed, or until the user specifically purges deleted messages as part of a "cleanup" operation. As Figure 4.1 shows, Outlook actually displays these deleted‐in‐place messages in a special font rather than hiding them completely, illustrating how IMAP messages are left in‐place when deleted.

Figure 4.1: Deleted messages in Microsoft Outlook.

All of this functionality is designed to provide users with a self‐service recovery option: If they accidentally delete a message, they can either undelete it or retrieve it from the Recycle Bin.

Once a message has passed beyond the realm of the Recycle Bin or other undelete options, recovering a message becomes your problem. Unfortunately, Exchange Server doesn't include any built‐in backup and recovery mechanisms; it does include an application programming interface (API) that allows other applications to interact with Exchange Server to make backups and perform restores. Exchange provides support for Windows' Volume Shadow Service (which I discussed in the previous chapter) for backups, too—the Windows‐native Backup utility, included with Windows server 2003, can use the Volume Shadow Service interface to make backups of Exchange.

Note: In what must have been a miscommunication between product teams, the new Windows Server Backup utility included in Windows Server 2008 does not offer support for Exchange Server backups, meaning you have no native option for producing backups of your Exchange data. Microsoft does offer an additional‐cost backup solution that supports Exchange backups, and obviously numerous third parties provide varying levels of support for Exchange backups. As of Exchange Server Service Pack 2, Exchange includes a new plug‐in that does enable Windows Server Backup to natively back up and restore Exchange databases.

As of Exchange Server 2007, Exchange includes a feature called Clustered Continuous Replication. CCR is designed to replicate Exchange database transactions—the individual changes that are made to the database—to a separate Exchange Server computer. There are specific hardware, software, and environmental requirements to make CCR work, and it does require that you have additional Exchange Server computers in the environment. CCR is Microsoft's preferred solution for whole‐server recovery because it essentially keeps a spare copy of the Exchange database on a separate machine. The costs involved in CCR can be quite high, however, because you're basically maintaining a complete, spare Exchange Server machine—hardware and all, unless your spare is virtualized—just sitting around waiting for the first server to fail.

Figure 4.2 shows what CCR looks like; notice that a third witness exists here. The witness' job is to make sure that the primary active node is working; if it isn't, the witness is key in making an automatic failover to the passive node happen.

Figure 4.2: CCR.

A variation of CCR is Local Continuous Replication; LCR differs in that it uses transaction replication to create a copy of the Exchange database on the local server, on a separate set of disks. This gives you a copy of the database without the need for a separate server, although your Exchange Server hardware obviously remains a single point of failure in that scenario. LCR is less expensive than CCR but does require extra locally‐attached storage on the Exchange Server computer.

CCR is primarily useful for whole‐server failures—a full‐on disaster, in other words (LCR is only useful at protecting against a database failure—if the entire server fails, you lose the LCR replica, too). Neither of these replication techniques is especially useful for recovering single messages. In fact, neither Windows nor Exchange offers a particularly effective solution for single‐message recovery. Instead, they rely entirely on the Recycle Bin functionality implemented in Outlook, the most commonly‐used Exchange client application.

Problems and Challenges

Exchange offers unique problems and challenges in the backup arena. I'll discuss many of these in more detail later in this chapter, but for right now, I want to briefly introduce them from a business, rather than a technology, perspective.

A Bit About How Exchange Server Works

Most of these derive from Exchange's architecture, so it's worth talking a bit about how that architecture works. Exchange is built around a transactional database. In this regard, Exchange is similar to SQL Server, although the underlying structure of Exchange databases is very different from the relational ones found in SQL Server. Transactional means some important things:

  • When changes are made to the database, a note of those changes is first made in the transaction log, a special file managed by Exchange Server. Changes might include new messages, incoming messages, or even changes to messages—such as marking a message as having been read or replied to. The transaction log is basically a "to do" list—"mark message #164783 as read" or "insert new message #7847829 with the following contents."
  • After noting the change in the transaction log, Exchange searches its database for the bits of data that actually need changed. Those bits are loaded into the server's memory, and the changes are applied to the data in‐memory.
  • After a certain amount of time, the changed data is written back to the database on disk. This happens quickly, but often not immediately; Exchange may choose to cache data in‐memory so that multiple changes can be applied in succession.
  • Once changed data is successfully written back to the database on disk, Exchange goes to the transaction log and "checks off" the change as having been completed and committed.

If the Exchange Server computer crashes or loses power, no work is lost provided the transaction log is intact. Even though changes existing only in memory may not have been safely written to disk, Exchange can go back to the transaction log and simply re‐do, or replay, any transactions not "checked off" as committed. This transaction log also provides the basis for LCR and CCR: Individual transactions are replicated from the Exchange Server log and replayed, creating an exact duplicate database.

Why Exchange Server Backups Can be Tricky

So the Exchange database files are actually the result of millions of transactions being applied in‐memory and then saved to disk. The database is obviously indexed so that individual messages can be found easily, and a great deal of structured data—such as which messages belong to which users—is stored in the database along with actual message data. Normally, while Exchange Server is running, only its processes have access to the database. All of this conspires to make certain backup and recovery tasks a bit more complicated:

Individual Item Recovery. Because individual messages are stored in a monolithic database rather than as individual files on disk (as is the case with most Unix‐based mail applications), recovering an individual message can be tough. You might, for example, have to retrieve the right whole‐database backup, then dive into it as Exchange would to find the message or messages you're after.

Data Corruption. The slightest data corruption on a backup can render the entire database useless—meaning whatever you're using to make backups has to be 100% reliable and capable of detecting and repairing data corruption.

Data De‐Duplication. Exchange actually has built‐in data de‐duplication of a sort. Individual messages (and, if configured, attachments) addressed to multiple users are stored only once in the database, provided all the users' mailboxes are in the same message store. Thus, backup software gets the advantage of this deduplication, creating smaller backups than if each message was extracted individually.

However, in order to facilitate single‐message recovery, many backup solutions not only back up the entire database but also extract individual messages and back up those independently. This lets them index, search, and recover individual messages—but with some solutions loses the built‐in de‐duplication and increases the size of the backup data.

Search and e‐Discovery. Exchange doesn't have good built‐in capabilities for searching across the entire database, which is often needed when you need to recover a message or perform e‐Discovery for legal purposes. Because many backup solutions grab the entire Exchange database, they're often incapable of searching through that database either.

Obviously, some of these special concerns are ones we'll have to address in a Backup 2.0 solution.

In the Old Days

I want to look at how old‐style Backup 1.0 solutions address both the basic and special needs of Exchange Server recovery. It's important to recognize what works and what doesn't in these traditional solutions so that we can identify areas for improvement as well as areas that should be retained in a new‐style, Backup 2.0 solution.

Backup Techniques

Exchange Server backups can be complicated from a process viewpoint. Consider Figure 4.3, which is excerpted from a blog entry at http://blog.thejpages.com/2008/03/18/whyare‐we‐still‐backing‐up‐exchange.aspx. The author proposes that Exchange Server not be backed up. Instead, he suggests enabling circular logging—meaning Exchange's transaction log will automatically overwrite older entries as needed to write new ones. Using CCR, the Exchange database is replicated—in this proposal—to two passive nodes, making two complete copies of the database. One passive node sits in the data center, ready to take over in the event of a failure in the production Exchange computer. The other passive node is in an offsite location and delays replaying incoming transactions by 7 days—meaning its copy of the database is always 7 days old.

Figure 4.3: Proposal to not back up Exchange.

There are some downsides to this approach. Although it provides great almost instant recovery capabilities with very little lost data, it would be very expensive to implement. You're looking at two additional Exchange Server licenses, even if they're on virtual machines. If they're not virtualized, that's two additional Exchange Server machines, too, and a great deal of network bandwidth. You'd still have to have some kind of backup running to support users who delete messages and want them back—in the proposal, Exchange retains deleted items for just 30 days and many organizations must—due to legal or industry regulations—retain messages for a far longer period of time.

Traditional Exchange backups typically seek to grab the entire database, usually by connecting to Exchange Server's Volume Shadow Service API. As mentioned earlier, these solutions may also extract individual messages through other APIs, giving them not only a complete copy of the database—used for disaster recovery—but also access to individual messages.

Restore Scenarios

The most common restore scenarios in Exchange are single‐message recovery or singlemailbox (including all of its messages) recovery. An article at http://www.msexchange.org/tutorials/ExMerge‐Recover‐Mailbox.html details a fairly common way of doing this in Exchange Server 2003: Start by installing the free ExMerge utility (available from http://microsoft.com/download). Restore your database backup from tape or whatever—you may wind up restoring it to a different Exchange Server so that you're not affecting your production server. As Figure 4.4 shows, you'll be able to use ExMerge to export the desired mailbox to a PST file, which can be opened with Microsoft Outlook. If you want to recover a single message, you attach that PST to an Outlook client and go hunting for the message you want. Messages can be "dragged" out of the PST file, via Outlook, and "dropped" into an active Exchange mailbox to get the message back onto the server.

Figure 4.4: Recovering a single mailbox in ExMerge.

Newer versions of Exchange offer a Mailbox Recovery Center that performs essentially the same task without the need for ExMerge. You still have to add a storage group to the Recovery Center, restore an Exchange database, mount the database into the Recovery Center storage group, then go browsing for the mailbox you want to recover. Figure 4.5 shows what the Recovery Center looks like. It's still time‐consuming, but perhaps prettier than ExMerge.

Figure 4.5: Mailbox Recovery Center.

Frankly, this whole process is nuts. It's slow, cumbersome, and incredibly labor‐intensive for administrators. This is why I've never been in a single environment that doesn't have some kind of third‐party backup solution—often just to provide more efficient singlemessage and single‐mailbox recovery.

Third‐party backup solutions work more quickly but involve substantially the same process. You go get your backup data off of tape—which will take some time because you might have to restore a full backup and multiple incremental or differential backups to recreate the point in time you want to restore from. The backup software usually maintains its own indexes of available messages and mailboxes, so you browse or search through that until you find what you want. What the solution typically automates is the pain of extracting the mailbox or message from the database and putting it back into Exchange Server—so the process is less labor‐intensive for the administrator but still pretty awkward and not necessarily very fast.

Disaster Recovery

Disaster recovery in Exchange is straightforward—and usually time consuming. You restore the most recent full backup. You restore any incremental or differential backups made since then. You restore any transaction log backups made since then. Finally, you stand back and let Exchange sort it all out—and be prepared to wait because the process can take hours. The development of CCR and LCR technologies was driven in large part by the time‐consuming nature of more traditional backups; with CCR or LCR, you've got a spare database sitting right there, ready to be used—provided what you're after is the most recent database. In true disaster recovery situations—a total hardware failure or even a data center disaster—you usually do want to get the latest version of the database back up and running quickly.

Where CCR and LCR fail is if something goes wrong that gets replicated—in those cases, the copy of the database will also contain whatever went wrong, and you'll be back to timeconsuming tape restores to rebuild your databases.

Backup Management

Depending on the backup solution you're using, managing traditional backups can be quite a science. Because Exchange databases can grow quite large, some organizations don't have the time to grab a full database backup as often as they'd like. That means you're stuck managing full backups, incremental backups, differential backups, and in many cases, log backups—backups of the Exchange transaction log.

In fact, managing transaction log backups is crucial to minimizing data loss in the event of a failure. The transaction log literally contains every single piece of work that Exchange needs to do; Exchange's ability to replay that log to re‐create work is an effective recovery technique. Some organizations will grab transaction log backups throughout the day; many third‐party Exchange backup solutions will bundle transactions from the log into 15minute "recovery points." Of course, losing 15 minutes of email traffic is still a pretty big disaster.

The downside to all this is simply management overhead and storage space. Exchange backups can occupy a lot of disk and tape space, and keeping track of all the files can be complex. In fact, most third‐party Exchange backup solutions are primarily solutions for managing backup files—after all, the actual backup functionality comes from Exchange's APIs.

Rethinking Server Backups: A Wish List

Let's revisit our Backup 2.0 "mission statement" from Chapter 1:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

This is a tricky statement to evaluate when it comes to Exchange. Certainly, with CCR or LCR, we can achieve backups that offer very little downtime; in the case of CCR, downtime might amount to a few seconds. We would certainly lose very little data, although some data loss is possible because both CCR and LCR utilize asynchronous replication, meaning it's possible for a few minutes' worth of transactions to occur on the source, yet not replicate to the mirror when a failure occurs.

But CCR and LCR have two distinct problems: First, they only maintain a current, working copy of the database; they don't provide a long‐term archive and they don't allow you to restore to a particular point in time. If you accidentally delete a mailbox, that deletion is replicated; after the deleted mailbox retention period elapses, the mailbox is gone forever no matter how many CCR or LCR replicas you have. Second, both CCR and LCR are expensive—in terms of storage resources for both, and in terms of additional hardware and software licenses for CCR. Neither LCR nor CCR are designed to provide single‐message recovery beyond the deleted item retention period in Exchange.

Note: It's been suggested to me by some administrators that they could simply set the deleted item retention period very high—as high as 5 years in one case I saw. I don't recommend it; the Exchange database will get huge, and it simply isn't intended to be a permanent archive. Performance will suffer, and it'll become harder and harder to get real backups as the database bloats.

Traditional third‐party backup solutions, which rely either on the Exchange transaction log or the Volume Shadow Service API, have all the failings of any Backup 1.0 solution—which I explained at length in the previous chapter. Backup file management, backup windows, time‐consuming restores, and other disadvantages have been frustrating Exchange administrators for more than a decade.

Let me clue you in to a little secret: The reason traditional Exchange backups can be frustrating or incomplete—even in the case of CCR and LCR—is that they rely on catching transactions as they occur or reading information from Exchange's own APIs. In other words, everything depends on how Exchange works. For an up‐to‐date backup, like what CCR and LCR offer, you have to replicate transactions. For a full point‐in‐time backup, you have to talk to Exchange and deal with the data the way Exchange gives it to you. When it comes to Exchange Backups 1.0, that's the ultimate problem. The solution? Stop dealing with Exchange for your backups.

New and Better Techniques

In the previous chapter, I posited a new type of backup solution that focused on disk blocks. Figure 4.6 illustrates my proposal: Forget about talking to APIs and just grab the information as it is written to disk.

Figure 4.6: A proposal for Backup 2.0.

Think about it: Software developers—like the ones who wrote Exchange Server—know that server memory isn't reliable. A loss of power, a software crash, whatever, and memory is lost. Disk, however, is much more reliable and is persistent. Exchange's transaction log is designed to help provide a cover for unreliable memory: In the event of a memory failure, the transaction log allows work to be replayed. For that reason, the log itself sits on persistent disk storage.

That means a Backup 2.0 solution can simply grab blocks of disk space as they are changed on disk. That will grab any changes to the transaction log as well as changes to the main Exchange database files (there are a few files for every mail store). By immediately shipping copies of those disk blocks to a separate server, you have a continuous backup that doesn't rely on talking to Exchange APIs, replicating transactions, or any other complexity. Now, that's all well and good for simple files on disk: If you want to restore something, just track down the disk blocks that comprise the file and put those blocks back on the server. Poof, file restored. For Exchange disaster recovery, as I'll explain in a moment, it's a fast way to get an entire server back online. However, how does it help with the more common restore scenarios like single mailbox recovery or single message recovery?

Better Restore Scenarios

This is where we put Backup 2.0 to the test: Does it meet the mission statement when it comes to Exchange?

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

I think it does. We've got a nearly real‐time backup of everything that gets written to Exchange's disks—including databases and transaction logs. That means we don't lose any data, and restoring data doesn't need to include taking Exchange offline (unless we're talking about a complete disaster recovery scenario—which I'll cover next).

With the right tools and interfaces, Backup 2.0 enables single‐item recovery—something I'll outline a bit later in this chapter. That's really the key restore scenario, although you might also, from time to time, need to recover an entire Exchange database and mount it elsewhere for testing purposes.

For example, I have one client who routinely restores Exchange databases to a disconnected Exchange Server, where they perform vulnerability scanning and anti‐spam testing. Backup 2.0 excels at this, as it can quickly bring back an individual file—even one as large as an Exchange database—very quickly. But that's kind of a Backup 1.0 mentality: With Backup 2.0, you could make that testing operation even faster by simply exporting the entire server's disk image into a bootable virtual machine. That virtual machine can be easily segregated for testing so that it wouldn't interfere with the production network (that kind of testing is where virtualization saw its first widespread uses, in fact).

Note: Backup 2.0 is all based on disk blocks—raw, disk‐level data. Where that data sits doesn't matter, meaning Backup 2.0 is also a great way to do physical‐tovirtual, physical‐to‐physical, virtual‐to‐virtual, and virtual‐to‐physical moves and migrations. I've used Backup 2.0‐style solutions in many cases to move physical Exchange Server computers into virtual machines as part of a larger enterprise virtualization project.

Better Disaster Recovery

Disaster recovery is what Backup 2.0 is all about, and Exchange Server is no exception. With a disk block‐based backup image, you can quickly restore your entire Exchange Server to not just the most recent backup but also to any given point in time. You can even restore your Exchange Server to a virtual machine, which is great for huge disaster recovery scenarios where you might be hosting those virtual machines at a recovery facility or even in some online hosting provider.

So how does Backup 2.0 complement or interfere with CCR, Exchange's built‐in recovery solution? Keep in mind that CCR requires a passive, standby server. That means the expense of additional Windows and Exchange licenses, and possibly the expense of dedicated hardware, all as a standby to a failure. That's an expense some organizations are happy to bear, but it's not for everyone. In some cases, your CCR passive node might be a less‐capable machine (that might be the case if it was running in a virtual machine, for example) designed to get you through a tough time with less‐than‐normal performance. In those instances, Backup 2.0 can help by getting Exchange up and running more quickly on your original, full‐powered hardware. For organizations that can't afford CCR, Backup 2.0 provides what is perhaps the next best thing: A fast way to quickly bring Exchange back online in a bare‐metal recovery situation.

Backup 2.0 complements CCR in that Backup 2.0 provides the ability to roll back to a previous point in time; CCR does not. CCR's goal, remember, is to create an exact replica of your Exchange databases with as little latency as possible. That means if you do something wrong, that something is going to replicate via CCR very quickly, meaning your "backup" is also messed up. CCR can't undelete a mailbox or a message, and it can't help recover from accidental or malicious actions. Backup 2.0, however, can do so—and it can protect your passive CCR nodes at the same time it protects your active Exchange Server computers.

What I like best about Backup 2.0 is that the backups can be restored almost anywhere. Lose an Exchange Server computer and don't have a spare handy? No problem: Restore to a virtual machine (I tell clients to always have at least one virtualization host hanging around that has some spare capacity—for just these emergencies). No reinstalling Windows, reinstalling Exchange, reconfiguring Exchange, and waiting on tape backups to unspool your backed‐up databases—just dump the entire server disk image into a virtual machine. It's a fast process and the result is a completely‐configured computer that is the computer you lost. Clients just reconnect and start working.

Easier Management

My Backup 2.0 idea does need to have a few more Exchange‐specific capabilities. For example, Exchange is designed so that transactions remain in its log until a backup of the database is made; that way, you're assured of the log serving as a backup for transactions until the related data is safely on tape. In a Backup 2.0 world, traditional backups don't occur—so the backup agent running on the Exchange Server computer needs to be smart enough to truncate the Exchange transaction log after transmitting the related disk blocks off to the backup server.

From there, you're left without much to manage. No log backups, no full backups, no differential backups—just the ability to restore, from an image, any bit of Exchange you need to, at any time. As you'll see in the next few sections, though, Backup 2.0 can enable some pretty impressive new management scenarios.

What About Performance?

Doesn't all this Backup 2.0 magic place some serious burden on your Exchange Server computers? In my experience, no. The majority of the overhead kicks in when you start de‐duplicating, compressing, indexing, and saving data that is usually offloaded to a centralized "backup server." All the Exchange Server‐based agent has to do is transmit disk‐block deltas over the network—you might see low single‐digit increases in things like processor utilization, but that's it. Backup 1.0 solutions tend to hit Exchange Server harder because they're not grabbing data in small chunks all day; they're trying to cram all their backup activity into a small evening window.

Exchange‐Specific Concerns

So how will Backup 2.0 work with Exchange? I've already pointed out how specialized Exchange is in the way it works; will Backup 2.0 be able to work with it and still provide a better backup solution than Backup 1.0 does? Much of Exchange's functionality and architecture are specifically designed to accommodate and work within a Backup 1.0 world—will turning that world into Backup 2.0 break everything?

CCR

A Backup 2.0 solution needs to be CCR‐aware. After all, CCR is still a valuable highavailability tactic, giving you near‐instantaneous failover in the event of a complete server failure. CCR even supports geographically‐dispersed clustering. So a backup solution needs to understand CCR and work to truncate the active CCR node's log based on the passive CCR node's replication of transactions.

Individual Item Recovery

Recovering mailboxes or individual messages to a PST file may be useful—but you shouldn't be stuck with that as your only option. Honestly, giving a user a PST file and telling them to drag and drop messages in Outlook is insanely primitive. A Backup 2.0 solution should eliminate that overhead and let you restore directly to a live Exchange Server computer.

Further, a Backup 2.0 solution should be able to recover individual messages from the same backup that would be used for disaster recovery. In other words, you shouldn't have to extract individual messages out of Exchange—you should be able to recover from the backup image of the actual Exchange database. How?

Practically speaking, it would probably be a two‐step process. Assume you have an imagelevel backup of an Exchange Server computer. That means you've got every block of data from disk, so you can re‐construct the entire server. With the right tool set, you would be able to "mount" that backup as a file system—in effect browsing the backed‐up file system from a specific point in time as if it were a network drive. But you're just looking at the backed‐up files—Exchange isn't running and its database files aren't opened and being used; you're looking at a static, point‐in‐time copy of those files. From there, the right tool would let you mount and browse the Exchange database—giving you access to individual mailboxes, messages, and other data. You wouldn't need to do the usual Exchange escapades of restoring the database file from backup, mounting the database to a Recovery Storage Group, and running ExMerge or other utilities against the mounted database. Instead, you'd just dive into the database using a utility that understands the database structure, and get what you need. It might look something like Figure 4.7.

Figure 4.7: Restoring messages from a mounted backup image.

De‐Duplication

Exchange's single‐instance storage allows each message to be stored only once in the message store—it's a form of data de‐duplication. But it doesn't help when a message is sent to users whose mailboxes are on different servers, or even users whose mailboxes are in different stores on the same server. In those cases, the message will be duplicated at least once per store.

But Backup 2.0 can do a better job of de‐duplicating data—at least data from a single server—because it's examining disk blocks. The same message stored on disk looks the same, even though that message might have to be duplicated across multiple mail databases. So the message will be duplicated in Exchange, but once it's backed up by the Backup 2.0 solution, that solution can detect the duplicate disk blocks and store only a single instance—meaning the backup can be smaller than the original database. Add in compression, something even Backup 1.0 solutions offer, and the backup can be many times smaller than the original data.

Data Corruption

Because data is being backed up on a block‐by‐block basis, it's easier for a Backup 2.0 solution to detect and correct errors. A single block of disk data might be as small as 4 kilobytes—detecting a transmission error in that small a piece of data is easy.

Search and e‐Discovery

Search and e‐Discovery are rapidly becoming key components for many organizations. The US Federal Court System, for example, has imposed strict rules that require pretty rapid responses to e‐Discovery requests during court proceedings; failure to meet these requirements can lead to fines and even summary judgments. Knowing that the Exchange database doesn't provide solid search capabilities natively, many companies rely on dedicated message archival and retrieval tools—an additional expense and yet another mass of storage resources to manage. A Backup 2.0 solution, however, can provide solid search and e‐Discovery capabilities built right in.

Consider the ability to mount a point‐in‐time backup image as a browse‐able file system, as I've described earlier. Also consider the ability to browse an offline Exchange database from that file system. Given those capabilities—which a Backup 2.0 solution might well offer as a means of performing single‐message recovery—you could easily implement a powerful message‐search function that makes message searching and e‐Discovery possible. Figure 4.8 shows what it might look like.

Figure 4.8: Searching for messages in a backup image.

Importantly, this tool would need to be able to attach to multiple Exchange databases to search; you can never tell ahead of time which database will contain the messages you're after, and you don't want to have to search each one individually.

In an e‐Discovery scenario, you'll typically want to restore messages not to a live Exchange Server computer but rather to a PST—which will often be delivered to legal counsel. Figure 4.9 shows how a Backup 2.0 toolset might implement that export.

Figure 4.9: Exporting search results to a PST file.

It's important that you not over‐build your expectations for a Backup 2.0 solution, though. Message search and recovery is only one aspect of e‐Discovery; many organizations that routinely deal with legal summons prefer to tag and categorize messages as they're sent; this makes it easier to locate messages on‐demand (and helps categorize messages for security and monitoring purposes, too), but obviously goes well beyond what you'd expect from a backup solution. So there will still be a market for specific e‐Discovery solutions, especially for very large companies who routinely have to perform e‐Discovery tasks.

Coming Up Next…

If you thought Exchange Server had some specialize needs, wait until I get into SQL Server in the next chapter. Microsoft's relational database management system is one of the few Microsoft products that has had a well‐understood backup and restore system for many years—but once again, I'll try and turn Backup 1.0 on its head and show you where our long‐used routines and techniques just don't meet modern business needs.

SQL Server Backups

More and more companies are using Microsoft SQL Server these days—and in many cases, they don't even realize it. While plenty of organizations deliberately install SQL Server, many businesses find themselves using SQL Server as a side effect, because SQL Server is the data store for some line‐of‐business application, technology solution, and so on. In fact, "SQL sprawl" makes SQL Server one of the most challenging server products from a backup perspective: Not only is SQL Server challenging in and of itself, but you wind up with tons of instances!

Here's what I see happening in many organizations: The company has one or more "official" SQL Server installations, and the IT team is aware of the need to back up these instances on a regular basis. But there are also numerous "stealth" installations of SQL Server, often running on the "Express" edition of SQL Server, that the IT team is unaware of. The data stored in these "stealth" installations is no less mission critical than the data in the "official" installations, but in many cases, that data isn't being protected properly. Dealing with this "sprawl" is just one of the unique challenges that Backup 2.0 faces in SQL Server.

Native Solutions

SQL Server has always offered a native application programming interface (API) for backing up databases. In fact, SQL Server has long been one of the few Microsoft server applications that natively supports tape backup, without using Windows' own backup utility. The native backup toolset is actually quite robust, supporting features like compression (highlighted in Figure 5.1), encryption, and so forth.

Figure 5.1: SQL Server's native backup interface.

To understand SQL Server's native backup technology, you need to first know a bit about how SQL Server works under the hood.

How SQL Server Works

SQL Server stores things on disk in 8KB chunks called pages. It also manipulates those same 8KB chunks in memory, meaning the smallest unit of data SQL Server works with is 8KB.

When data is written to disk, an entire row of data must fit within that 8KB page. It's possible for multiple rows to share a page, but a row cannot span multiple pages. So, if a Customers table has columns for Name, Address, City, State, and Phone, then all that data combined must be less than 8KB. An exception is made for certain data types—such as binary data like photos, or large gobs of text—where the actual page only contains a pointer to the real data. The real data can then be spread across multiple pages, or even stored in a file. SQL Server gathers all these 8KB pages into a simple file on disk, which usually has either an .MDF or an .NDF filename extension.

When SQL Server is told to do something, it's by means of a query, written in the Structured Query Language (SQL) syntax. In the case of a "modification" query, SQL Server modifies the pages of data in memory. But it doesn't write those modifications back out to disk yet, as there might be additional changes coming along for those pages and the system load might not offer a good disk‐writing opportunity right then. What SQL Server does do, however, is make a copy of the modification query in a special log file called the transaction log. This file, which has an .LDF filename extension, keeps a record of every transaction SQL Server has executed.

Eventually—maybe a few seconds later—SQL Server will decide to write the modified pages out to disk. When it does so, it goes back to the transaction log and "checks off" the transaction that made the modifications—essentially saying, "Okay, I made that change and it's been written to disk." That way SQL Server knows that the change is safe on disk.

In the event that SQL Server crashes, it has an automated recovery mode that kicks in when it starts back up. It goes straight to the transaction log and looks for uncommitted transactions—those that have not yet been "checked off." It knows that the "checked off" transactions are safe on disk; anything else had not been written to disk and was still floating around in memory when the server crashed. So SQL Server reads those transactions out of the log, re‐executes them, and immediately writes the affected pages to disk. This process allows SQL Server to "catch up" with any in‐progress work, and ensures that you never lose any data—provided your disk files are okay, of course.

Think about this important fact: EVERYTHING that happens in SQL Server happens only through the transaction log, and SQL Server can re‐read the log to repeat whatever has happened. This process makes nearly everything that SQL Server does possible.

How SQL Server Native Backup Works

SQL Server's native backup system works in conjunction with the transaction log. Essentially, there are two types of backup SQL Server can make: data backups and log backups. Data backups are, as you might suspect, of the database itself. These are done in a Backup 1.0‐style manner, grabbing a snapshot of the data as it sits during the backup. Log backups grab the contents of the transaction log.

SQL Server's native backup capabilities include the ability to back up a database while it's in use, although database performance can slow slightly while a backup operation is underway. The ability to back up an in‐use database means that SQL Server is less impacted by "backup windows" than many other server products, and it means that you're a bit less tied to the Backup 1.0‐model of only grabbing backups while the data isn't being used. But that doesn't mean SQL Server is entirely free of backup problems and challenges.

Problems and Challenges

There are a few distinct challenges presented by traditional SQL Server backup techniques:

  • Sprawl. As I mentioned earlier, most organizations have a lot more SQL Server installations than they often realize, and backing up them all can be painful. In some cases, particularly with the "Express" editions often embedded into line‐of‐business applications and IT tools, SQL Server is running on a client computer that isn't being treated like a server in terms of backup and recovery.
  • Snapshots. Just like any Backup 1.0 scenario, SQL Server backups are built around the idea of point‐in‐time snapshots. As I'll describe in a bit, SQL Server does offer some unique abilities that let you take more snapshots more frequently, but you'll always have a certain amount of data at risk.
  • Recovery times. Although SQL Server can be pretty flexible in how it makes you do backups, restoring is still a time‐consuming operation. So time consuming, in fact, that some companies have created tools that can "attach" a database backup to SQL Server, allowing the backup data to be queried without actually having to restore the database. This trick is useful for things like change control, but it doesn't help from a backup and recovery perspective simply because the attached backup is read‐only.
  • Transaction logs. In SQL Server, backups are intimately tied to the transaction log, and backups are required in order to keep the transaction log from growing larger and larger. Any backup plan that doesn't use the native APIs needs to deal with this fact.

Any proposed backup solution that does not use SQL Server's native APIs will be challenging. In fact, most third‐party backup solutions are simply agents that sit on top of SQL Server's native APIs! This setup ensures that SQL Server's internal needs—like the transaction log—are taken care of, but it also has historically limited third‐party solutions to the same basic feature set as SQL Server's native capabilities. Most third‐party SQL Server backup solutions are really little more than an agent that takes data from SQL Server's native APIs, and transmits that data across the network.

In the Old Days

So how has SQL Server traditionally been included in a backup and recovery plan? Let's consider some of the techniques, scenarios, and tools that are common in the Backup 1.0 world.

Backup Techniques

SQL Server natively offers three types of backup. I know I said two earlier, but hear me out:

  • Full Backup. This is a complete backup of the entire database. Once made, committed transactions in the transaction log are cleared, a process called log truncation; this is what keeps transaction logs from growing forever.
  • Differential Backup. This is also a backup of the database, but only the data that changes since the last full backup is included. Again, the transaction log is truncated.
  • Transaction Log Backup. This doesn't grab any of the actual data; it simply grabs the current state of the transaction log—and then truncates that log.

So two kinds of data backup and a log backup. Although SQL Server can back up an active database, it's not something you'd do during peak database usage due to performance concerns, so full and even differential backups are still usually done during off‐peak periods or during an evening or weekend maintenance window. Because it can be difficult to get a nightly full backup of large databases in that window, administrators typically resort to a tiered backup plan—grabbing full backups on the weekends, for example, and differentials each evening. To help reduce the amount of at‐risk data, transaction log backups can be made periodically throughout the day. These backups are very fast and offer little performance impact, so a practical backup plan might look something like the one in Figure 5.2.

Figure 5.2: Typical SQL Server backup plan.

With this plan, the maximum amount of at‐risk data is about an hour, as that's the interval between transaction log backups. Of course, in a busy database, an hour can be a lot of data!

Reviewing our manifesto for Backup 2.0:

Backups should prevent us from losing any data or losing any work, and ensure that we always have access to our data with as little downtime as possible.

An hour of at‐risk data certainly doesn't prevent us from "losing any data or losing any work." In addition, the restore scenario associated with this kind of backup plan is, as you shall see, hardly conducive to "as little downtime as possible."

Restore Scenarios

SQL Server recovery can be a time‐consuming thing. Essentially, you have to start with your most recent full database backup, then add on the most recent differential and every transaction log backup made since then.

In fact, you have to be very specific about what you're doing when you conduct a restore— an aspect of SQL Server that I've frankly seen a lot of administrators mess up pretty badly. If you conduct a normal, full database restore, SQL Server will by default put the recovered database online as soon as it's done with the restore operation. If you still have a differential or some log backups to apply, you're out of luck; you have to start the restore over. The trick is to tell SQL Server, as you're restoring the full backup, that you have more files to restore. You continue telling it that until you restore the last transaction log backup, at which time you tell SQL Server that it's safe to start recovery. Then SQL Server will start applying the differential, then the transaction log backups, and then your database will be ready to use. "As little downtime as possible" isn't very little, in most cases, and you'll still be missing any changes that occurred after the most recent transaction log backup.

SQL Server Recovery

For a large database, SQL Server's recovery time can be quite lengthy. Let's say you use the backup plan shown in Figure 5.2, and something goes wrong at 4pm on Friday afternoon. You'll have a full backup from the prior weekend, Thursday night's differential—which may be quite large, since it contains all the changes from the full backup up to Thursday night—and hourly transaction log backups.

Not only do you have to wait for all those files to stream off tape or wherever you store them, you have to wait for SQL Server to work through them. It has to apply the differential backup to the full backup, then it has to replay each individual transaction from every single transaction log—in essence, it has to re‐perform all the work that was done all day Friday. For a large, busy database, it may be a long time before the database is ready to use.

SQL Server doesn't natively support single‐object restores. What you can do is restore a backup to a different database, then manually copy any objects you want restored from that backup. This lets you recover single stored procedures, tables, or even rows of data— provided you know how to do so manually.

SQL Server does support point‐in‐time recovery, with the obvious caveat that it can't restore to a point in time later than your most recent backup. Point‐in‐time recovery only works with transaction log backups because transactions in the log are time‐stamped. If you discard Thursday's transaction log backups after making a differential backup on Thursday night, then the first point in time you can recover to is the time of that Thursday night differential. This actually makes backup management tricky because to enable maximum point‐in‐time recovery, you have to keep a lot of files hanging around: full backups, every night's differential, every hour's transaction log, and so forth.

Consider this scenario: You're using the example backup plan from Figure 5.2, which entails a weekly full, nightly differential, and hourly transaction log backups. Let's say you keep 3 weeks' worth of backups, and Week 3 is the most recent set. It's Friday afternoon, and you realize someone deleted a critical stored procedure. You need to recover the database to the previous Wednesday (Week 2) afternoon; Figure 5.3 illustrates the files that you have on‐hand and which ones you'll have to restore.

Figure 5.3: Sample recovery plan.

So, that's:

  • The beginning‐of‐Week‐2 full backup
  • The Tuesday night differential from Week 2
  • The Wednesday transaction log backups up to the point Wednesday afternoon you want to recover to

That's six or so files to recover, and then you wait for SQL Server to sort it all out. In total, you'll be keeping something like 140 files lying around, assuming you take a transaction log backup eight times a day (once every hour during the normal working day).

Disaster Recovery

SQL Server doesn't offer any kind of native disaster recovery capabilities. Essentially, if you lose an entire server, you'll have to recover Windows, install SQL Server, and then start restoring SQL Server backups to bring your databases as up to date as possible. Traditional third‐party imaging software isn't effective because it's difficult to image an active SQL Server installation, and because imaging doesn't always work well with SQL Server's native backup capabilities—meaning it can be tricky to restore an image and then also restore normal SQL Server backups to bring your databases more up to date.

In short, let's hope you don't lose an entire SQL Server.

In fact, whole‐server disaster recovery for SQL Server is so unsatisfying that Microsoft has made a considerable investment in SQL Server high‐availability features that try to reduce the need to ever do a whole‐server recovery. Some options include:

  • Transaction Log Shipping. The idea here is to start off with two servers that have an identical copy of a database, then "ship" the transaction logs from the active server to the "hot spare" server. The spare re‐plays the transactions to bring its copy of the database up to date; the theory is that if the main server dies, the hot spare can be brought it to replace it.
  • Database Mirroring. Essentially the same idea as transaction log shipping, only the "hot spare" is kept more up to date and can take over automatically if the main server dies.
  • Clustering. Utilizing Windows' native clustering capabilities, this provides a completely redundant server with direct access to the same database files as the "main" server.

All of these options require additional SQL Server installations and hardware (or virtual servers), and they're all designed to handle a complete‐failure scenario; none of these actually provides for point‐in‐time recovery capabilities, so they're to be used in addition to normal backup techniques. It can get pretty expensive, especially for smaller and midsize companies who may not be able to afford this level of recoverability—at least in a Backup 1.0 world.

Backup Management

I touched on this earlier, but the short message is that SQL Server backup management can be pretty painful, unless you're only worried about restoring the database to its most recent state. In that case, you keep the most recent full backup, most recent differential, and all transaction logs since the differential; that's still a lot of files to maintain but it's a lot less than trying to keep a few weeks' worth of files.

I once had a job where we needed to be able to restore the database to any point in time for 3 months. You can imagine the number of files we had to maintain; I think it was close to 600 backup files, all floating around on different tapes, some of which had to be rotated offsite—it was a nightmare and just describing it is giving me unpleasant flashbacks. In fact, it was at that exact point in time that I started to realize that the Backup 1.0 way of doing things was not very efficient—especially because managing that many files still left us atrisk for an hour or more of data and work.

Rethinking Server Backups: A Wish List

So how can Backup 1.0 be improved from a SQL Server perspective? There's certainly plenty of room for improvement based on the traditional techniques and approaches I just discussed.

New and Better Techniques

The whole idea of being able to make transaction log backups to have less data at‐risk is wonderful, but it is ultimately a kludge. It's a workaround to the snapshot‐oriented approach of Backup 1.0; I've said it before and I'll say it again here: Backups should be continuous. It's not practical to continually make transaction log backups, and that's the best SQL Server can offer; that means we have to move outside of the native APIs. That's scary, I know, because so much of SQL Server depends on folks using those native APIs. But stick with me.

If we acknowledge that SQL Server's native APIs aren't going to give us frequent‐enough backups, we need to look at other ways of getting to the data. Going through SQL Server is not the answer because SQL Server doesn't have the bandwidth to feed us any kind of continuous data stream. Instead, we need to grab that data directly from the operating system (OS), as the data hits the disk. Keep in mind: As complicated as SQL Server is, ultimately it's all just bits on disk. There's no reason we couldn't have a Backup 2.0‐style agent sitting on the SQL Server computer, grabbing disk blocks as SQL Server writes changes to the disk.

Clever readers will have spotted a problem with this theory, from my explanation on how SQL Server works:

When SQL Server is told to do something, it's by means of a query, written in the Structured Query Language (SQL) syntax. In the case of a "modification" query, SQL Server modifies the pages of data in memory. But it doesn't write those modifications back out to disk yet, as there might be additional changes coming along for those pages and the system load might not offer a good disk‐writing opportunity right then.

Oops. If SQL Server doesn't write the data to disk quickly, then that data is at‐risk because all we're grabbing are the changes that actually make it onto the disk. But the answer to this potential problem also lies in the very way that SQL Server works:

What SQL Server does do, however, is make a copy of the modification query in a special log file called the transaction log. This file, which has an .LDF filename extension, keeps a record of every transaction SQL Server has executed.

The transaction log itself is just a file on disk; Microsoft knows perfectly well that any data living entirely in memory is always at‐risk, and so the transaction log's entries are written to disk immediately. All our agent would need to do is also grab the changes to the transaction log. Then, in a failure, we'd simply restore the database, restore the transaction log, and let SQL Server's nature take its course.

Better Restore Scenarios

The restore scenarios in a Backup 2.0 world would be vastly improved. For one, the concept of Backup 2.0 involves continuously streaming changed disk blocks to some central repository; that being the case, you'd simply select the exact point in time you wanted to restore to, then stream those disk blocks right back to where they came from. You might have to shut down SQL Server while you did that, but you might not; programmers can get pretty clever at manipulating SQL Server, and SQL Server itself is pretty open to manipulation in this regard.

Suddenly, no more worrying about full backups, differentials, and transaction logs. You don't care about the files per se; you only care about the disk blocks from the active database and log files. You're not making backups in the SQL Server sense of the term; you're actually just putting the computer's disk back to the condition it was at a certain point in time. SQL Server never actually enters its "recovery mode," because you've not restored any files in the SQL Server fashion. SQL Server simply resumes working with the database and log files just as it normally works with them.

This entire idea, which is at the heart of Backup 2.0, took me a while to really sort out in my mind. In the end, everything we know about backups is wrong, which is why I chose to use the term "Backup 2.0." This is an entirely different way of looking at things.

What about single‐object recovery? That would still be tricky. Backup 2.0 will certainly let us restore a single database, rather than an entire server, if desired. But keeping track of which disk blocks within a database file go with a particular stored procedure, for example—that would probably be impossible. It certainly sounds difficult. But a Backup 2.0 solution should allow us to quickly restore a database to a different location—after all, a database is just a bunch of disk blocks, and they shouldn't care where they wind up—and we could use SQL Server's native tools to script out a stored procedure and then run that script on our production database, or use SQL Server tools to just copy database objects like users or whatever from one database to another.

Single SQL Objects: Tricky, Tricky

Part of the reason it's so tricky to recover a single SQL Server database object is that SQL Server stores objects—like stored procedures—as text definitions within a set of system tables inside the database. In other words, objects are externally indistinguishable from data; in most regards, objects are data.

A useful tool to have handy, then, is some kind of SQL Server comparison tool; use your favorite search engine to look for "sql server diff" and you should find several. These tools compare two databases—like a restored database and an in‐production version—and show you the differences. In most cases, differences can be "forwarded" into the other database, making it easier to spot the exact differences between a restored database and its live counterpart, and to "restore" specific objects from the restored database into the live database.

Some Backup 2.0 toolsets might even include such comparison utilities, and might even incorporate them into the recovery process for a more seamless experience when you're just looking for a single object out of a backup.

Better Disaster Recovery

There's no doubt that Backup 2.0 can offer a better disaster recovery option than more traditional techniques. Just consider Figure 5.4, which compares the two philosophies in a practical disaster recovery timeline.

Figure 5.4: Comparing disaster recovery techniques.

Backup 1.0 is on the left, where you're spending a ton of time manually recovering software and letting SQL Server deal with its backup files. Backup 2.0 is on the right, where you're simply pushing all the server's disk blocks back to the server's disk—recovering the OS, SQL Server, data files, and everything all at once, and to a specific point in time.

Now, I do realize that many third‐party backup solutions of the Backup 1.0 variety do make backups of the entire server, and that many of them offer bootable CDs or DVDs that can kick‐start a whole‐server recovery. That's great—except that it still leaves you "recovered" to an old snapshot. Making a backup of an entire server is even more time consuming than backing up a single large database; you're less likely to have an up‐to‐the‐minute backup, meaning more data is at risk and more time will be spent during a restore.

Another advantage of the Backup 2.0 technique—as I've pointed out in previous chapters— is that disk blocks really don't care where they live. Disk blocks could be restored to a different server, if the original one's hardware was irrecoverable. Disk blocks could be written to a virtual server, giving you a fantastic option for off‐site recovery in the event of a data center disaster, like a flood or loss of utility power.

Easier Management

Management, for me, is where Backup 2.0 really has a chance to shine. Having managed the

600‐ish files involved in a previous company's SQL Server backup plan, I love the way that Backup 2.0 doesn't focus on specific point‐in‐time snapshots. That means no managing backup files. Instead, you manage a single backup repository, where all the backed‐up disk blocks live. Rather than juggling files and tapes, you use a centralized management console—like the one shown in Figure 5.5, for example—to manage the entire repository, which might well handle backups for many, many servers. You select the server you need to restore, select the disk volume that contains the files you want to restore, and indicate the point in time you want to restore to. The repository figures out which disk blocks are involved in that recovery operation, and streams them to the server you designate. Anything from a single file to an entire server can be recovered in the same fashion.

Figure 5.5: Examining the Backup 2.0 repository.

Easier management—letting the software juggle the backup data—is one of the real advantages that can be realized when we start rethinking what backups are all about.

SQL Server‐Specific Concerns

So how will Backup 2.0 help address some of the concerns that are unique to SQL Server? Obviously, it depends on the exact Backup 2.0‐style solution you're talking about, but there are certainly ways in which solution vendors could handle SQL Server issues.

Sprawl

Sprawl isn't a problem with Exchange Server or SharePoint; those applications live in the data center. SQL Server, however, spreads throughout the organization in desktop‐level installations where users might not even realize that their data is contained in SQL Server. Even with client‐level backup agents, this SQL Server data often goes unprotected; clientlevel agents are usually designed for simple file‐and‐folder backup, and don't typically include a SQL Server‐specific agent. The "hidden" SQL Server instances run continuously, just like any installation of SQL Server, thwarting simple file‐and‐folder backup schemes at the client level.

Backup 2.0 can help. Because the whole idea of Backup 2.0 is based on capturing disk blocks, it can wedge itself into the file system at a very low level, using built‐in Windows hooks designed for exactly this sort of activity. Figure 5.6 illustrates where a Backup 2.0 agent can work its way into the system while only adding 1 to 2% of overhead to the client system. This low level of overhead makes this type of agent perfectly suitable for workstations running "Express" or desktop instances of SQL Server.

Figure 5.6: How backup 2.0 fits in.

Here's how I envision it working: A Backup 2.0 agent, written as a file system "shim," registers itself with the OS. When SQL Server saves data to disk—whether to a database file or to a transaction log—the shim is notified by the file system. As the file system writes the data to physical storage, the shim can read the newly‐written blocks, compress them, and send them across the network to a central repository.

It's far more efficient, especially for client computers, than snapshot backups. Workstations running an "Express" edition of SQL Server typically have fairly low SQL Server workloads, since SQL Server is really serving only a single application in use by one or a few users. Rather than laboriously backing up the entire system or every database file every so often, Backup 2.0 just streams the few disk blocks that have changed. In a "sprawl" environment, it's the perfect way to create consolidated SQL Server backups—for all that important data that's living in all your "stealth" SQL Server installations.

Log Truncation

If Backup 2.0 is just capturing disk blocks, when does the SQL Server transaction log get truncated? Well, in some instances you might think you could just stop using the transaction log. As Figure 5.7 shows, a SQL Server database can be configured to use a "Simple" recovery model. Unlike the "Full" model, which operates as I described earlier in this chapter, the "Simple" model basically truncates the log as soon as a transaction's changes have been written to disk. In other words, the transaction log still exists, but it's not a recovery option because it won't ever contain very many transactions—it'll only contain those transactions whose data pages are still in memory.

Figure 5.7: Configuring a database for simple recovery model.

At first glance, it might seem like this would be perfect in a Backup 2.0 world. After all, you still get transaction log recovery for an unexpected server power outage, but the transaction log is self‐maintaining and doesn't need to be truncated. Your Backup 2.0 solution is grabbing disk blocks almost in real time, and shipping them off to a backup repository—so what good is the transaction log?

In some scenarios, I'd say "go with Simple recovery!" But in others, I still like to have the piece of mind that the transaction log offers—and so I'd look for a Backup 2.0 solution that had the ability to truncate the log just as SQL Server does when it makes a successful backup. In other words, if the backup solution isn't using native SQL Server APIs—which do truncate the log after a successful backup—then the backup solution should fully replace those APIs, including log truncation capability.

Single‐Object Recovery, Corruption, and Off‐Location Restores

Single‐object recovery, corruption, and off‐location restores might seem like three pretty random topics to throw together, but they're all potential issues that are solved by the same thing. As I've described earlier in this chapter, single‐object recovery in SQL Server pretty much always involves restoring the database to a different location, then copying objects from the restored database to the production database. That's just inherent in the way SQL Server works. The problem is that restoring a backup, as we've discussed, can take a lot of time. Spending hours restoring files just to grab a single accidentally‐deleted stored procedure or view is painful and unrewarding in the extreme.

There are many other reasons to restore a database to a different location, which is also called an off­location restore or alternate­location restore. One reason is to compare a backed‐up database to the current, live database, using some kind of comparison tool. Another reason might be to access data that was deliberately purged from the live database. Typically, the database is restored, from backup, to a different server, or to the same server under a different database name—both of which require "scratch space," or temporary space to hold the entire restored database for however long it is needed. Then, of course, there's also the time to restore all the database files and let SQL Server process them.

My third issue is one of backup data corruption, which is an insidious thing we've all run into: "The backup tape is corrupted!" Although Backup 2.0 relies less (or not at all) on tapes, data corruption is still a real concern, and any decent backup solution will offer ways to detect corruption.

All three of these issues—off‐location restores, single‐object recovery, and data corruption—can be handled by a single feature we can add to our Backup 2.0 spec: I propose calling it backup mounting. Think of it like this: Our backup repository contains a bunch of disk blocks, each one time‐stamped to let us know when it was captured. There's really no reason we couldn't select a bunch of disk blocks from a given point in time, feed them to a specialized file system driver, and "mount" them like a normal disk volume. Figure 5.8 shows what I mean.

Figure 5.8: Mounting backed­up disk blocks as a disk volume.

A file system drive's job is to take some form of storage and make it look like a disk volume, files, and folders. Microsoft basically does this same thing in Windows 7, where the OS allows you to mount a virtual machine image as a disk volume. I'm simply proposing that Backup 2.0 include a special file system driver that lets Windows "see" selected portions of the backup repository as if they were a real, live, read‐only disk drive.

Once that's accomplished, the backup solution can make SQL Server database and log files available without performing a restore. So in zero time, your database files could "appear"— in read‐only form, of course—and be attached to a live instance of SQL Server. This would enable the backup solution to conduct "attachability" tests, to determine whether the backed‐up image could be operated as a full database. If it couldn't, then corruption would be suspect. You could also attach backed‐up databases on‐demand for comparison purposes or for single‐object recovery—without ever having to restore anything. You stay more productive, you don't need "scratch space," and you can "attach" a version of the database from any point in time. Truly a remarkable set of capabilities from a fairly simplistic notion—which is really what Backup 2.0 is all about: Simple, new notions that radically change the way we work, for the better.

Coming Up Next…

The last major server product I have to cover is SharePoint, which offers its own unique challenges. In fact, SharePoint—which has rapidly grown in popularity in the past few years—may be the biggest challenge that Backup 2.0 has to face. As I have in this and the previous chapters, I'll look at native solutions, cover problems and challenges, and compare the "1.0" way of doing things with a more enlightened "2.0" approach.