How bleeding edge are people with kernels (Was Re: [Beowulf] impressions of Super Micro IPMI management cards?)
stephen mulcahy
smulcahy at aplpi.com
Wed Nov 21 09:05:52 PST 2007
Brian Dobbins wrote:
> I had at one point a simple script that would allow me to select a
> kernel type at job submit time, it would load that up, reboot the nodes
> with that kernel, and then run my job. Sometimes this was incredibly
> useful, as I found a difference of roughly 20-25% performance on one
> particular code running on the same hardware, one with an /old/ 2.4
> series and libc, and another with a more modern kernel + libc. Even
> now, as we're looking at a larger system, I'll probably put (in a static
> fashion) one of the interactive nodes with a kernel supporting PAPI, and
> quite possibly will put most of the compute nodes on a kernel with some
> modifications for performance.
Thanks for your response. We're running a diskless environment aswell
(it's a pretty small cluster - 20 nodes running a customised Debian).
Performance is certainly interesting to me -- but stability is starting
to become so also. We've squeezed a good bit out on the performance
front by tweaking various components in the system including the MPI
libraries and so on. So much so that the scientists I'm running the
cluster for are largely happy with the performance (I suspect there
could be another 5-10% lurking in there, but getting it out would
probably involve a lot of my time and a lot of cluster downtime for
testing/profiling .. so it feels like we're in the sweet spot at the
moment).
So we're happy with performance, and now we'd like to run our models for
weeks on end without any user intervention. What we have seen as we
start doing this is some stability problems that have not been
consistently reproducible so far and have left no traces in the logs (I
might send a separate mail about these just to generally pick peoples
brains) -- the key point here though is that I have no idea at the
moment if these are kernel level problems or hardware level problems.
We're running Debian's stable kernel 2.6.18-5-amd64 (for the diskless
nodes, we're using the 2.6.18-5-amd64 kernel source, recompiled after
stripping out all unneccesary drivers). My concern about rolling to
2.6.22 or something in between is that we might get some performance
benefits but we might also get more intermittment wierd stability issues
(the kind that may even be peculiar to our own hardware/software
environment). I was just wondering what other peoples take is -- clearly
a lot depends on your own risk aversion level, how much time you have
for testing and supporting what you deploy and so on. Thanks to all that
responded.
> In case anyone is interested, I'm planning on bugging the National
> Labs + Cray guys a bit more soon, and if they can't release or document
> what they change, I'll set up a wiki about kernel stripping / tuning for
> HPC workloads, and maybe the community can put together a decent
> 'how-to' until the big guys can chime in. If/when I find the time, I'll
> also try to get some information on how much this can impact performance
> on some modern code suites, but it might take a few weeks at least
> before I'm able to do so.
I'm not sure how much of the stuff thats relevant to tuning really big
clusters would percolate down to the likes of myself but I would be
interested in taking a look at it anyways.
> Disclaimer to all of the above - I haven't done much system-level
> stuff in a long while now, so your mileage may vary considerably. :)
Oh, I understand that all suggestions on beowulf include the standard
"But it depends" disclaimer :)
Thanks,
-stephen
--
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
More information about the Beowulf
mailing list