[Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo

Eugen Leitl eugen at leitl.org
Fri Sep 4 01:17:22 PDT 2009


http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html

Some perspective to this DIY storage server mentioned at Storagemojo

Thursday, September 3. 2009

I've received yesterday some mails/tweets with hints to a "Thumper for poor"
DIY chassis. Those mails asked me for an opinion towards this piece of
hardware and if it's a competition to our X4500/X4540. Those questions arised
after Robin Harris wrote his article "Cloud storage for $100 a terabyte",
which referred to the company Backblaze, which constructed a storage server
on its own and described it on their blog in the article "Petabytes on a
budget: How to build cheap cloud storage". Sorry, that this article took so
long and there may be a higher rate of typos, as my sinusitis came back with
a vengeance ... right in the second week of my vacation. But now this rather
long article is ready :-)

At first: No, it isn't a system comparable to an X4540 ... even without the
considerations of DIY versus Tier-1 vendor. I have a rather long opinion
about it, but let's say one thing at first: I see several problems, but i
think it fits their need, so it's an optimal design for them and they
designed it to be the optimum for them. I assume, many problems are addressed
in the application logic. The nice thing at custom-build is the fact, that
you can build a system exactly for your needs. And the Backblaze system is a
system reduced to the minimum.

This device is that cheap because it cuts several corners. That's okay for
them. But for general purpose this creates problems. I want to share my
concerns just to show you, that you can't compare this to a X4540 device.

And even more important: I have to deny the conclusions of the Backblaze
people. This isn't a good design, even when you just need cheap storage, when
you don't own a middleware that does a lot of stuff that ZFS would do in the
filesystem for example in the hardware. On the other side it supports my
arguments in regard of the waning importance of RAID controllers. The more
intelligent your application is, the less intelligent your storage needs to
be.

So ... what are my objections to this DIY device:

    * The DIY Thumper has no power-distribution grid. So when one PSU fails,
all devices connected at this power supply will fail. In the case PSU2 fails,
the system board is away, thus the machine fails. Game over ... until power
comes back.

    * Connected to the last problem: Given the disk layout, the
power-distribution isn't correct. They use it with RAID6, but RAID6 just
protects you against 2 failures. I don't see a sensible layout in three RAID6
groups, that would allow the system to loose 25 disks at once. A more
reasonable RAID Level would be RAID10, but there you have 5 disks without a
partner in the other PSU failure domain.

    * I don't know if i consider a foam sleeve around the disks and some
nylon screws as enough vibration dampening, especially when your hard disks.
I'm looking forward to the next article they announced which was announced to
cover this topic. It will be even more interesting to hear more about it in
the future because of the performance and the longevity of the disks in such
an environment. Just an example for the real world: Once we found out that
disks near to a fan were a tad slower than the ones far away from the fan.
This led to changes to the vibration handling in that system.

    * This baby cries for ZFS. So much capacity, no battery backup RAID
controller, only 10^14 disks. But i see the reason, why this choice wasn't
feasible for them: Since a few weeks ago, the OpenSolaris SATA framework
hadn't support for port multiplier. This was introduced with the putback of
PSARC/2009/394 to OpenSolaris. But now it's integrated. And given, that this
baby just speaks HTTPS to the outside and the software relies on Tomcat, it
should be a piece of cake to move to Opensolaris and ZFS now.

    * This design isn't really performance oriented. As they use Port
multiplier to couple their disks to cheap SATA PCIe/PCI controller, one 3
GBit/s interface has to feed 5 disks. One ST31500341AS delivers round about
120 MByte/s (saw several benchmarks suggesting such a value). Five of them
deliver 600 MByte/s, a little bit less than 6 GBit/s. So each SATA channel is
oversubscribed by a factor of two.

    * Even more important, three of the connections to the port extenders are
coupled to a standard PCI-Port. One PCI-Conventional 3.0 port (didn't find an
information what the board provides, thus i assumed the fastest, source is
the german wikipedia page about PCI) is capable to deliver round-about 4
Gigabit/second (to be exact 4,226 GBit/s). Thus you connected 18 GBit/s worth
of hard disks at 4 GBit/s worth of connectivity.

    * I have similar objections for the PCIe connection for SATA-cards. Those
ports are PCIe at 1x. One PCIe 1x port has a theoretical throughput of 250
MByte/s. So such a port would be fully loaded by just two hard disks. But
this baby connects ten disks to a single lane of PCIe.

    * Of course those hard disks doesn't run at max speed all the time, i
assume the load pattern will be very random in the special use case of
Backblaze. But this leads to a high mechanical load to the disks and to some
additional objections. Based on the manual of the hard disk, i see two
problems here:

	  o The ST31500341AS is a desktop disk. Not even one of this nearline
disks like we use in the X4500/X4540. When you look in the disk manual, all
reliability calculations were done on the basis of 2400 hours of operation
per year. But a year has 8760 hours.

	  o The reliability considerations of Seagate assume a desktop usage
pattern, not a server usage pattern.

	  o

	  o Seagate writes in their manual itself: "The AFR and MTBF will be
degraded if used in an enterprise application". But given the long credits
list at their end, i assume they've read the manual and considered this in
their choice of hard disks.

	  o There is another important point about the reliability of the
disks: The AFR and the MTBF for the 7200.11 is valid for a surrounding
temperature of 25 degrees celcius. Running it above this temperature reduces
the MTBF and increases the AFR. Other harddisks build with enterprise usage
in mind use another normal temperature vastly higher. 

    * But due to the usage of RAID-6 those disks will see a high throughput
in any case. RAID6 relies on a READ/MODIFY/WRITE cycle due to the nature of
RAID6. So you write vastly more than just the modified data to disc. This may
even interfere with the sparse throughput of the system. We've introduced
RAIDZ, RAIDZ2 and RAIDZ3 to circumvent this kind of problems

    * No battery backup for the caches, but RAID6 ... well ... "Warning ...
write holes ahead"

    * This system uses a Desktop Board, the DG43NB, thus system resources are
a little bit sparse on this board. Just 1 processor and just 4 GB of RAM. I
find the later one a little bit problematic. For general purpose a lot of
more memory would be feasible. There are good reasons to have 32 GB or 64 GB
in a X4540. Without a large amount of cache, you aren't able to shave off a
little bit of the IOPS load to get back to a moderate load, thus the choice
of a desktop disks gets even more problematic here.

I think, Robin Harris is correct with his comment, that this system is a
DC-3. It flies, it can transport goods and passengers from A to B in a
reasonable, but not fast speed but don't forget your parachutes ;-) It's the
same with this storage, this hw needs the parachute in form of the software
in front of the device.

But, and this is one of the key take aways for you ... even when other
systems are more expensive, they are not overpriced. At first don't compare
the mentioned list prices with the street prices for components. Second: Of
course you can save an dollar at one or the other place, but: The seagate
hard disk costs you 100 Euro at a big german computer online-shop, the
HUA721010KLA330 (aka Hitachi Ultrastar A7K1000 1TB) costs you roundabout 200
Euro after a search at Google. Just using other (in my opinion correct for
general purpose) disks, would double the price despite offering less storage.
And even this price isn't indicative, as most often there are special
agreements between drive manufacturers and system manufactures because of
quality standards, quality management and conditions.

The technical differences of the UltraStar: 1 errors in 10^15, qualified for
24/7 operations by the manufacturer, qualified for a enterprise work pattern
(and even here only a lighter one) and 1.2 Million Hours MTBF normalized on
40 degrees (AFAIK) instead of 0.7 million Hours at 25 degrees.

Quality costs. Period. The same for a desktop board in the DIY-"Thumper"
instead of a custom build board for optimal performance (a SATA controller
for each disk or using 8x lane PCIe for 8 disks instead of 1x lane PCIe for
10 disk e.g.). I'm pretty sure Sun could build an equally priced system, when
you take the bare metal of the X4500 chassis and rip out all the specialities
of the X4500/X4540 systems. But such a system with so many corners left
wouldn't a be a system you expect from Sun. And yes, the X4540 has less
capacity at the moment, but i think it's not far too fetched, that the X4540
gets 2TB drives as soon as they reached the same quality standards and
qualification as the current drives givinh the X4540 a capacity of 96 TB.

To close this article: It's about making decision. Application and hardware
has to be seen as one. When your application is capable to overcome the
limitations and problems of such ultra-cheap storage (and the software of
Backblaze seems to have this capabilities), such a DIY thing may be a good
solution for you. If you have to run normal applications without this
capablities, the general-purpose system looks as a much better road in my
opinion.

Posted by Joerg Moellenkamp in English, Solaris, Sun, Technology, The IT
Business at 15:22 | Comments (11) | Trackbacks (0)

View as PDF: This entry | This month | Full blog




More information about the Beowulf mailing list