[Beowulf] Building new cluster - estimate

Joe Landman landman at scalableinformatics.com
Wed Aug 6 19:01:17 PDT 2008

Eric Thibodeau wrote:

>> Advantage of modules is you can upgrade them without upgrading the 
>> kernel.  Go ahead, build in that e1000 driver.  I dare yah... :(
> Ok...I didn't put enought emphasis on "main" stuff....as in, _all you 
> need to get the system booted, which essentially means HDD chipset 
> drivers, the rest I do build as a module (NIC, video and such).
>> More to the point it does give some good flexibility for end users 
>> with a need to keep the core "separate" from the drivers for maintenance.
>> Initrd is subtle and quick to anger.  One must use burnt offerings to 
>> placate the spirits of initrd.
> LOL!

  ... now I don't mean hardware burnt offerings ... smoke rising from 
your motherboard may not placate the spirits of initrd, they definitely 
may impede further operations ...

>> Well, it would be a heck of a lot nicer if the tools were a little 
>> more forgiving ... Oh you don't have this driver in your initrd ... ok 
>> ... PANIC (mwahahahaha)
> Pahahahahah... Point in case, I am building a CD-only cluster system 
> (based on Gentoo) and I am currently _NOT_ using initrd because all that 
> really needs to be built in is NFSroot support an all NICs I care to put 
> in. Obviously this is a deprecated approach but it's proven to be the 
> most effective and easy to maintain in my case.

We build an integrated NFSroot and e1000 and a few other things for a 
customer.  Fixed hardware for their cluster.  From bare-metal-off to 
operational infiniband compute node in ~45-60 seconds (I say 45, but a 
few things took a little longer to start, like SGE).

>>> <rant>
>>> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but 
>>> that's work in progress...you could still build the cluster using 
>>> Gentoo...if you're performance savvy...and want things like OpenMP 
>>> capable compiler 
>> I have been hearing claims like this for a long time.  I have not seen 
>> any real tests that back these claims up.  Do you have any?
> I'm actually working on such benchmarks. Did you know that compiling 
> with the default ICC optimization will cause your bridge to crumble due 
> to floating point assumptions?...
> Ok, so my computation have diverged horribly mostly because I am 
> computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning 
> dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 
> FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, 
> the error adds up. So performance is very sensitive to what your 
> intended goal is too ;)

Hmmm.... sounds like a fun computation.  Error definitely adds up. 
Renormalization is your friend (well, some times, assuming a linear system).

>>   Most of the arguments I have heard are "oh but its compiled with 
>> -O3" or whatever. Any decent HPC code person will tell you that that 
>> is most definitely not a guaranteed way to a faster system ...
> Hey...as I stated above, one would have to be quite silly to claim -O3 
> as the all well and all good optimization solution. At least you can 
> rest assured your solutions will add up correctly with GCC. To get a 

Well, sometimes.  You still need to be careful with it.

This said, I am not sure icc/pgi/... are uniformly better than gcc.  I 
did an admittedly tiny study of this http://scalability.org/?p=470 some 
time ago.  What I found was the gcc really held its own.  It did a very 
good job on a very simple test case.

Then again, the fortran version was simply faster than the C version, 
but that can be explained ... by ... er ... ah ... something.

> "faster" system, you really have to look at your app, use strace, ltrace 
> and gprof, then you can play with that. What I _am_ saying though is 
> that Gentoo _does_ empower the administrator by giving him the ability 
> to customize the OS if a bottleneck is to be identified.

Yup.  There is nothing like a profile of an app running the code, to see 
where it is spending its time to decide between code shifts and 
algorithmic shifts.

>>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish 
>> Er... We often use several different compilers in several different 
>> trees.  Several gccs, pgi, icc, eieio ... you name it.  All are 
>> integrated.
> Are-you currently able to run GCC-4.3.x versions on your current setup, 

Currently running 4.2.3-2ubuntu7 on my laptop.  Other machines 
(development box) has something like 4 different gccs there.  I haven't 
tried 4.3.x yet ... had planned to, but work gets in the way.

> I'm actually eager to know. I'm still living under the ASSumption od 
> binary distributions not coping too well with multi-library 
> environments. Point in case, one of my colleagues _really_ wanted 

No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems 
apart from the occasional broken hard wired /usr/lib with the wrong ABI 
in a configure/make file.  Usually easy to fix.

> firefox 3 on his ubuntu system. The installer trickled down to having to 
> uninstall glibc...and he forced it to YES (and this is just a browser, 
> not something that is used to _make_ code and would be tied to glibc)

Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for 32 
bit access (java and other things).  No glibc changes (apart from 
security patches).  He must have done something horribly wrong.  We have 
multiple mixed ABI ubuntu/centos/suse systems, and haven't had issues.

>>> afterthought of an RPM that pulls in a new glibc that breaks the install 
>> Er ... not the slightest clue as to what you are talking about.  I 
>> haven't seen gcc, icc, pgi, ... touch our glibc.
>> Maybe I am missing the fun.  Which ICC version is this?  Which gcc is 
>> this, which glibc is this?
> Sorry about that I might have been misleading, GCC is generally the one 
> most sensitive to glibc, not the other ones although the latest ICC 
> (10.1.x series) do claim compatibility with the GNU environment so it 
> might get a little more dependency there.

We have installed the 10.1.015 on customer machines from Centos 5.2 
through SuSE 10.x through Ubuntu with nary a problem.  Very different 
glibc's.  No issues with code generation.

Binary distributions aren't evil.  They do work, quite well in most cases.

> Cheers!
> Eric

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list