[Beowulf] Big storage

Loic Tortay tortay at cc.in2p3.fr
Tue Sep 4 06:35:12 PDT 2007


According to Bruce Allen:
[...]
>
> In a system with 24 x 500 GB disks, I would like to have usable storage of
> 20 x 500 GB and use the remaining disks for redundancy. What do you
> recommend?  If I understand correctly I can't boot from ZFS so one or more
> of the remaining 4 disks might be needed for the OS.
>
This is (in my opinion) probably the only real issue with the X4500.
The system disk(s) must be with the data disks (since there are "only"
48 disks slots) and the two bootable disks are on the same controller
which effectively make this controller a single point of failure (there
are easy ways to move the second system disk to another controller, but
you still need a working "first" controller to boot).

Using ZFS for "/" is not easily done yet (as far as I know it's only
available in OpenSolaris at the moment and it's not even available at
installation time), so you need to use SVM (Solaris Volume Manager) if
you want to mirror the system disk.

The ZFS configurations we use minimize the impact of a single failing
controller (which becomes more likely since there are 6 of these).

Although in our experience, controller failures are rare on the X4500
(one failure in over a year with a few tens of X4500).
The controllers are "simple" SATA controllers, there are probably less
likely to fail than more advanced RAID controllers.

The most frequent failure are (obviously) disks failure (about 3/week).


Below is a X4500 disk tray as seen from "above", the columns are the
controllers (they're "physically" that way), the rows are the SCSI
targets (for instance the "Sys1" cell which is the first bootable
device is -- in Solaris lingo -- c5t0 aka c5t0d0).

The "vX" marks in the cells are used to specify membership to a "vdev"
(a ZFS "virtual device" which can be a single disk or metadevice, a
n-way mirror or a "raidz" volume).  Here the vdevs are all raidz.

This is the default configuration provided by Sun, which is globally
pretty good (in terms of redundancy, reliability and performance):
                +-----------------------------------------------+
                |                Controllers		        |
                +-----------------------------------------------+
                |   c5     c4      c7      c6      c1      c0   |
  +-------------+-----------------------------------------------+
    ^       7   |  v1   |  v1   |  v1   |  v1   |  v1   |  v1   |
    |    -------+-----------------------------------------------+
    |       6   |  v2   |  v2   |  v2   |  v2   |  v2   |  v2   |
    |    -------+-----------------------------------------------+
    |       5   |  v3   |  v3   |  v3   |  v3   |  v3   |  v3   |
    |    -------+-----------------------------------------------+
    D       4   |  Sys2 |  v4   |  v4   |  v4   |  v4   |  v4   |
    i    -------+-----------------------------------------------+
    s       3   |  v5   |  v5   |  v5   |  v5   |  v5   |  v5   |
    k    -------+-----------------------------------------------+
    s       2   |  v6   |  v6   |  v6   |  v6   |  v6   |  v6   |
    |    -------+-----------------------------------------------+
    |       1   |  v7   |  v7   |  v7   |  v7   |  v7   |  v7   |
    |    -------+-----------------------------------------------+
    |       0   |  Sys1 |  v8   |  v8   |  v8   |  v8   |  v8   |
  +-------------------------------------------------------------+

All our machines have 48 disks but we tested a beta version of the
X4500 with only 24 disks a bit more than one year ago.
Only the "lower" 24 disk slots were populated (disks on rows 0
to 3).

I'm not sure how the system handles system disk failure in such a case,
since the second bootable disk slot is empty, but you could use
something like this:
                +-----------------------------------------------+
                |                Controllers		        |
                +-----------------------------------------------+
                |   c5     c4      c7      c6      c1      c0   |
  +-------------+-----------------------------------------------+
    ^       7   | empty | empty | empty | empty | empty | empty |
    |    -------+-----------------------------------------------+
		[...]
    |    -------+-----------------------------------------------+
    D       4   | empty | empty | empty | empty | empty | empty |
    i    -------+-----------------------------------------------+
    s       3   |  v1   |  v1   |  v1   |  v1   |  v1   |  v1   |
    k    -------+-----------------------------------------------+
    s       2   |  v2   |  v2   |  v2   |  v2   |  v2   |  v2   |
    |    -------+-----------------------------------------------+
    |       1   |  v3   |  v3   |  v3   |  v3   |  v3   |  v3   |
    |    -------+-----------------------------------------------+
    |       0   |  Sys1 | spare | spare | spare | spare |  Sys2 |
  +-------------------------------------------------------------+

That is only 15x500 GB of usable space, but with lots of security.

In order to match your usable space requirement, you can either use two
or three (which would be better) of the spare disks in the vdevs but
this makes the machine globally less resilient to controller failures.

You can also avoid the second system disk altogether and use the last 5
disks on row 0 as a 4th vdev (ZFS allows vdevs to be of differents
sizes, even though this requires a '-f' -- "force" -- flag to the
"zpool create" call), which would yield 19x500 GB of usable space.

It's of course better to have vdevs of similar size but the available
space is not limited by the smallest vdev (unlike most RAID-5/RAID-6
implementations).

The size difference has a small (but visible) impact on performances
but, depending on your I/O workload, you can still get more throughput
from the disks than the 4 on-board Gigabit interfaces can handle.

According to several Sun engineers, it's also highly recommended to
have (raidz) vdevs of only a few disks (less than 10).

A more interesting configuration with 24 disks would be:
                +-----------------------------------------------+
                |                Controllers		        |
                +-----------------------------------------------+
                |   c5     c4      c7      c6      c1      c0   |
  +-------------+-----------------------------------------------+
				[Empty slots]
    D    -------+-----------------------------------------------+
    i       3   |  v1   |  v1   |  v2   |  v2   |  v3   |  v3   |
    s    -------+-----------------------------------------------+
    k       2   |  v1   |  v1   |  v2   |  v2   |  v3   |  v3   |
    s    -------+-----------------------------------------------+
    |       1   |  v1   |  v1   |  v2   |  v2   |  v3   |  v3   |
    |    -------+-----------------------------------------------+
    |       0   |  Sys1 |  v1   |  v2   | spare |  v3   |  Sys2 |
  +-------------------------------------------------------------+

ZFS hot-spares are global for a "pool", the disk in "c6t0" can replace
any data disk.

This gives 4 security (3 parity+1 spare) disks and 3 identically sized
vdevs with a usable space of 18x500 GB.

A bold configuration would be (starting from the previous one),
to use the "spare" and "Sys2" disks in "v2" and "v3" respectively, to
get 20x500 GB of usable space but at the expense of no hot-spare or
security for the system disk.


If I'm not mistaken, Sun now sells (again) X4500 with 250 GB disks.

In order to get 10 TB/server, it's probably better (in terms of
performance and data security) to have 48x250 GB, although I guess that
you plan to buy a "half-full" machine to be able to add 24 larger (and
cheaper) disks later.


There are several very interesting blogs from Sun engineers about ZFS
(linked from <http://blogs.sun.com/main/tags/zfs>).
For instance this entry deals with the balance between data security 
and available (raw) space using some hard disk drives reliability 
figures:
 <http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl>.


Loïc.
-- 
| Loïc Tortay <tortay at cc.in2p3.fr> -     IN2P3 Computing Centre     |



More information about the Beowulf mailing list