[Beowulf] Building new cluster - estimate

Tue Jul 29 23:42:19 PDT 2008

stephen mulcahy wrote:
> 
> 
> Bill Broadley wrote:
>> In general I'd say that the new kernels do much better on modern 
>> hardware than the ugly situation of downloading a random RPM, or 
>> waiting for official support.  Seems like quite a few companies (ati, 
>> 3ware, areca, intel, amd, and many others I'm sure) are trying hard to 
>> improve the mainline kernel drivers.
>>
>> I understand why RHEL doesn't change the kernel (stability, testing, 
>> etc.), but not sure it's the best fit for HPC type applications, 
>> especially with the pace of hardware changes these days.
> 
> Hi Bill,
> 
> My take on recent (2.6.x) mainline kernels was that there isn't as clear 
> a distinction between production quality and developer quality kernels 

Yup, pretty much all the mainline kernel.org releases receive a fair bit of 
testing and percentage wise change very little, occasionally there is an 
exception like what happened in, er, I think it was 2.6.10 when they changed 
either the MMU or scheduler.

> these days as there used to be in the previous even/odd 
> production/developer kernels. From scanning the kernel releases, it 
> looks like you'd want to stay a minor revision or two behind the 
> bleeding edge if you want some stability.

Sure, although I'm not sure you mean 2.6.24 when 2.6.26 is out, or 2.4.26.1
when 2.6.26.3 is out.  Seems pretty rare that any mainline kernel is outright 
unstable. Even when it is it's usually just a particular problem that effects 
a relatively small fraction of users.... something I'd hope would be exposed
by relatively simple testing.

With HPC type use if a kernel dies in product I'll revert, sure I like to run 
reliable clusters, but I'm usually abandoning the centos kernel because of a 
major win like a more reliable RAID.

But sure, I'd recommend joining the kernel list if you run a kernel.org kernel 
to see if people start screaming bloody murder.  I'd strongly recommend a mail 
reader that supports threads, it's basically impossible to read all of it.

> Has this been your experience or do you have extensive test facilities 
> before rolling out mainline kernels onto production systems?

Extensive test facilities... no definitely not.  Enough to see that the centos 
kernels are completely broken on my hardware... often.  Raid corruptions, 
dropped disks, horrible network performance, unsupported cards, poor memory 
performance, assuming wrong defaults for a CPU, missing PCI ids, disabled 
driver because someone somewhere on the planet made a broken motherboard, numa 
issues, CPU frequency issues, cpu temperature sensor issues, etc.

But in the 10 clusters I run I usually make decisions for the file servers vs 
compute nodes differently and have a workload that i use to decide if it's 
good enough to try in small production runs.  Not particularly comprehensive, 
but definitely tests the stuff I use heavily.

After all I'm using something like less than 1% of the kernel, very few 
drivers and my hardware is identical (at least within a cluster).