[Beowulf] Big storage

Tue Sep 11 08:54:01 PDT 2007

Loic Tortay wrote:

> We specifically use a ZFS configuration that is resistant to a single
> controller failure.
> 
> With one controller failed (8 disks unavailable) no data is lost.
> Of course the machine is then very fragile since a single disk failure
> on another controller will lead to data loss.

Ahh...  Ok.  We can use 4 controllers, and set it up so that we can lose
one w/o loss of data (12 drives/controller as compared to 8), though you
will see a corresponding decrease in storage capacity.

> I think Bruce's initial implied question was, "did you experience
> another hardware failure on that machine before the repair that
> ultimately led to data loss ?" The answer to that question is no.
> 
> My point regarding the two controllers in your machine, was that with
> two controllers you can't have a configuration resistant to a single
> controller failure unless you mirror the data (or add optional
> controllers).

See above.  We usually recommend RAIN for this anyway ... the cost of
internal / external replication is comparable to RAIN, and RAIN is more
resilient by design.  In a good RAIN design you isolate failures within
hierarchies.

> Replacing the mainboard in a X4500 is actually easier than replacing a
> PCI-e card.

???  Would take the unit offline.  PCIe's can be hot-swapped.  Hot
swapping MB's???

> You can change the "control module" without taking the machine out of
> its rack and there's no (internal) cable to unplug.

Ahhh....

> 
> But in this case I happen to be plain wrong.  As I've been told by one
> of my coworker in charge of the X4500 operations, the SATA controllers
> of the X4500 are not on the mainboard but on the backplane.  Changing
> the backplane requires more work than changing a PCI-e card.

Thats what I had thought.  Requires lifting the drives off of the
backplane as I remember.

>>> The density of the X4500 is also slightly better (48 disks in 4U
>>> instead of 5U).
> Sorry, you're right.
> 
> I was referring to density in terms of disk slot per rack unit but
> forgot to mention it.
> 
> [...]
>>> As of today we have 112 X4500, 112U are almost 3 racks which is quite
>>> a lot due to our floor space constraints.
>> Ok, I am not trying to convert you.  You like your Sun boxen, and that
>> is great.
>>
>> I will do a little math.  BTW:  thats a fairly impressive size floor you
>> have there.  112U of x4500 or 112 x4500?
>>
> We have 112 X4500 in 14 racks.  That's almost 2.7 PBytes raw, 1.9
> PBytes usable space.

Wow...  color me impressed.  Thats quite a bit of disk.

> According to Sun, we are the largest X4500 user in the world.
> We were already last year, since we had one machine more than the Tokyo
> Institute of Technology (featured as an "X4500 success story" on Sun
> website).

Heh ... cool!

> 
> 
> [my benchmark is larger than yours :-)]

Quite possibly.

>> What I like are real application tests.  We don't see many (enough) of
>> them.  I think I have seen one customer benchmark over the last 6 years
>> that was both real (as in real operating code) that actually stressed an
>>  IO system to any significant degree.
>>
> We stopped using IOzone for our tenders a few years ago and moved to a
> "model based I/O benchmark" simulating applications I/O workloads.
> It's similar to "filebench" from Sun (but simpler) and is used to
> test more useful I/O workloads (for instance threads with different
> concurrent workloads and a few things that "filebench" does not, like
> accessing raw devices -- useful for disk procurements for our HSM or
> Oracle cluster).

:)

> My pointless result was of course mostly due to cache, with 4 threads
> each writing 1 Gbyte to 4 existing 2 GBytes files (one file per
> thread).  The block size used was 128 kBytes, all (random) accesses are
> block aligned, the value is the average aggregated throughput of all
> threads for a 20 minutes run.

I seem to remember being told in a matter of fact manner by someone some
time ago, that only 2GB of IO mattered to them (which was entirely
cached BTW), so thats how they measured.  Caused me some head
scratching, but, well, ok.

My (large) concern on iozone and related is that it spends most of its
time *in cache*.  Its funny, if you go look at the disks during the
smaller tests, the blinkenlights don't blinken all that often ...
(certainly not below 2GB or so).

Then again, maybe IOzone should be renamed "cache-zone" :)  More
seriously, I made some quick source changes to be able to do IOzone far
outside cache sizes (and main memory sizes) so I could see what impact
this has on the system.  It does have a noticable impact, and I report
on it in the benchmark report.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615