skylar.thompson at gmail.com
Sat Sep 14 17:47:57 PDT 2013
On 9/14/2013 3:52 PM, Andrew Holway wrote:
> Anyone using ZFS in production? Stories? challenges? Caveats?
> I've been spending a lot of time with zfs on freebsd and have found it
> thoroughly awesome.
We have a bunch of ZFS-based storage systems, all running Solaris, and
falling into two classes:
* Sun/Oracle hardware - We have a dozen or so X4540s. Most of these run
Solaris 10, one runs Solaris 11 before Oracle came out and said Solaris
11 was /not/ supported on the X4540. Older versions of Solaris 10 had
some hardware integration problems and nasty ZFS and networking bugs,
but the latest patch cluster has solved all of those. Solaris 11 does
have issues, though, with hardware integration. Another issue we've had
is that Oracle is perpetually out of spare drives - the rumor is that
the Seagate drives Sun shipped in the X4540s have manufacturing defects
that shorten their service life considerably, and Oracle has struggled
to get other drives certified for the systems. We've easily lost 40
drives in our X4540s this year alone out of 500-600 total, all Seagate.
We've had to wait six weeks for 1TB SATA replacements, on NBD contracts.
* Dell hardware - before we rolled our current consolidated storage, we
had a number of labs needing to buy bulk storage urgently. We ended up
buying Dell servers and drive trays, and running Solaris 11 with ZFS.
We've had some challenges, but for the price it definitely has worked
out. Until we updated to the latest Solaris 11 patch cluster, we had
some difficulty identifying failed drives. We've also had trouble with
networking drivers, and tracking down other hardware problems like
failing NICs causing system hangs. There definitely isn't as much
integration between Solaris and the hardware as with real Oracle
hardware. The good news is that the moment you say "Solaris" to Dell
support they just believe whatever you tell them, without having to run
additional diagnostics. This makes hardware repair much faster than on
Linux or Windows systems.
We considered running FreeBSD on some of these systems, but the lack of
enterprise support made us somewhat leery (not that Oracle support is
all that great). Definitely if you're going Solaris make sure to get the
latest patch cluster. In addition to the hardware-specific bugs, we also
ran into a ZFS bug that caused it to ignore media and transport errors
for drives even when the hardware and fmadm are reporting faults, and
another one that would cause scrubs to hang the system.
One thing I wish we had done was buy SSDs for at least some of these
systems, particularly the ones with lots of tiny files. ZFS metadata
overhead is pretty high, but separating out L2ARC/ZIL onto SSD would
have made performance much better. Live and learn, I guess...
More information about the Beowulf