[Beowulf] zfs tuning for HJPC/cluster workloads?

Sun Jul 6 13:47:59 PDT 2008

Loic Tortay wrote:
> Joe Landman wrote:

[...]

> We have seen the same issue on (non Sun) high density storage servers 
> which performed correctly with RHEL5 & XFS but comparatively poorly with 
> Solaris 10 & ZFS.
> 
> ZFS seems to be extremely sensitive to the quality/behaviour of the 
> driver for the HBA or RAID/disk controller, especially with SATA disks 
> (for NCQ support).  Having a driver is not enough, a good one is required.
> 
> Another point is that ZFS requires a different configuration "mindset" than
> "ordinary" RAID.

Hmmmm....

Here is what I like.  Setting up a raid is painless.  Really painless.

Here is what I don't like.  I can't tune that raid.  Well, I can, by 
tearing it down and starting again.  I tried turning off checksum, 
compression, even zil.

The thing I wanted to do was to put the log onto another device, and 
following the man pages on this resulted in errors.  zpool would not 
have it.

> Have you noticed the "small vdev" advice on the Solaris Internals Wiki ?

Yeah, they mention 10 drives or less.  I tried it with two 8-drive 
vdevs, 1x 16-drive vdev, and a few other things.

> This is probably the single most important hint for ZFS configuration.
> IOW, most of the time you can't just use the same underlying 
> configuration with ZFS as the one you (would) use with Linux.
> This means that you may need to trade usable space for performance,
> sometimes in more drastic ways than with ordinary RAID.

Tried a few methods.  Understand, we have a preference to show the 
fastest possible speed on our units.  So we want to figure out how to 
tune/tweak zfs for these systems.

> 
> Finally, like it or not, ZFS is often more happy/efficient when it does 
> the RAID itself (no "hardware" RAID controller or LVM involved).

The performance on pure zfs sw-only raid was lower (significantly) than 
the hardware RAID running solaris.  I tried several variations on this. 
  That and the crashing (driver related I believe) concern me.  I would 
like to be able to get the performance that some imply I can get out of it.

I certainly would like to be able to tune it.

> Loïc.
> 
> PS: regarding your other message in this thread (and your blog), you 
> seem confused: the "open source" OS is OpenSolaris, not Solaris 10.

Hmmm .... we keep hearing that "Solaris is open source" without 
providing any distinction between Sun Solaris and Open Solaris.  Maybe 
it is marketing not being precise on this.  Ask your Sun sales rep if 
Solaris is open source, without specifying which one.  The answer will 
be "yes".  Ambiguity?  Yes.  On purpose?  I dunno.

> The benchmark publishing restriction only applies to Solaris 10 (see 
> <http://www.opensolaris.com/licensing/opensolaris_license/>).

Yup.  Will eventually try OpenSolaris on this gear.

> PPS: while I dislike Sun's policy, I specifically remember being told by 
> someone from a DOE lab (who did actually evaluate your product about 18 
> months ago) that you didn't want their unfavorable benchmarks results to 
> be published.  You can't have it both ways.

Owie ... no one is having it "both ways" Luc.   Everything we are doing 
in testing is in the open, and we invite both comment and criticism ... 
like "Hey buddy, turn up read-ahead"  or "luser, turn off compression." 
    Our tests and results are open.  Others can run them, and report 
back results.  If they give me permission to publish them, I will.  If 
they publish them, I may critique them (we reserve the right to respond).

As a note also, you just dragged an external group into this discussion, 
and I am guessing that they really didn't want to be.  So I am going to 
tread carefully here.

We published a critique of the published "evaluation", pointing to the 
faults, and doing a thorough job of analyzing the same.  We didn't deny 
them the right to publish their results.  As a result of this, we got in 
return, a rather nasty email/blog post trail.  I still have it in my 
mail archives, and it is hidden in the blog archives.  I won't rehash 
it, other than to point out that some on this list would take issue with 
the results.

I removed my critique after they asked me to, with them promising in 
return to amend and address my criticisms.  As far as I can tell, they 
withdrew their report, and did not amend or address my criticisms.

More curious are the reports that the group responsible for this report, 
has run away from their (formerly) preferred platform towards a BlueArc 
platform.  There was a nice quote from the principal author of the 
report to this effect (moving forward with BlueArc) last year in 
HPCWire, for what they were considering the other unit (thumper) for.

This said, they were free to use the unit and publish benchmark results, 
which they did.  We criticized the benchmark they did for its flaws in 
analysis, in execution, and setup, as we were free to do.

Nobody is having it "both ways" Luc.  We reserve the right to respond, 
and we did.  We did not ask them to take down the report.  They did ask 
us to take our criticisms of their report down.

FWIW:  I will not name or divulge the group's name in public or private. 
  I ask that anyone with knowledge of this group also keep their 
names/affiliation out of the discussion.  Luc dragged them in here, and 
I would like to accord them some measure of privacy, no matter whether I 
agree or disagree with them.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615