[Beowulf] Surviving a double disk failure

Mon Apr 13 00:56:28 PDT 2009

On Friday 10 April 2009 23:15:54 David Mathog wrote:
> Billy Crook <billycrook at gmail.com> wrote:
> > As a very,
> > very, general rule, you might put no more than 8TB in a raid5, and no
> > more than 16TB in a raid6, including what's used for parity, and
> > assuming magnetic, enterprise/raid drives.  YMMV, Test all new drives,
> > keep good backups, etc...
>
> Thankfully I don't have to do this myself, not having data anywhere near
> that size to cope with, but it seems to me that backing up a nearly full
> 16TB RAID is likely to be a painful, expensive, exercise.
>
> Going with tape first...
>
> The fastest tape drives that I know of are Ultrium 4's at 120 MB/s.  In
> theory that could copy 1GB every 8.3 seconds, 1TB every 8300 seconds (
> AKA 138 minutes, or a bit over 2 hours), and for that 16 TB data set,
> something over 32 hours.  Except that there is no tape with that
> capacity, Max listed is still 800 GB, so it would take 20 tapes.  And
> really obtaining a sustained 120MB/s from the RAID to the tape is likely
> extremely challenging.  In any case, it looks like this calls for a tape
> robot of some sort, with many drives in it.  Not cheap.  On the plus
> side, transporting a box of 20 tape cartridges to "far away" is not
> particularly difficult, and they are fairly impervious to abuse during
> shipment.
>
> The other obvious option is to replicate the RAID.  Now if the duplicate
> RAID is on site, connected by a 1000baseT network, one could obtain a
> very similar transfer rate - and a full backup would take just as long
> as for the single tape drive (neglecting rewind and cartridge change
> times).  This at the expense of still losing all the data in some sort
> of sitewide disaster.  I can imagine, and suspect somebody has this
> already, implementing, a specialized disk->disk connect, such that one
> would plug Raid A into Raid B, and all N disks in A could copy
> themselves in parallel onto all N disks in B at full speed.  Assuming
> 1TB disks and a sustained 75Mb/sec read from A and write to B, the whole
> copy would be done in about 222 minutes.  Not exactly the blink of an
> eye, but a heck of a lot better than 32 hours.   Placing the backup RAID
> physically offsite would improve the odds of the data surviving, but
> reduce the bandwidth available, and moving the copied RAID physically
> offsite after each backup is a recipe for short disk lives.
>
> Since all of the obvious options are so slow, I expect most sites are
> doing incremental backups.  Which is fine, until the day comes when one
> has to restore the entire data array from two year's worth of
> incremental backups.  Or maybe folks  carry the tape incremental backups
> to the offsite backup RAID and apply them there?
>
> Is there an easier/faster/cheaper way to do all of this?

I had a client where we setup 2 servers in 2 different physical locations with 
good interconnect between them(1Gbit/s).

So both servers had identical hardware setup (RAID5 with 8x 1TB disks, 1 Hot 
spare and 2 NICs, one dedicated for backup and one for system usage).

What I did, was to setup a DRBD device between both machines so when there is 
a power outage in the first location or a disaster they had another server 20km 
away that was serving their data(this includes a MySQL, PostgreSQL and files).

This setup is used both as backup(DR) and failover.

Regards
Marian Marinov
Head of System Operations at Siteground.com