[Beowulf] Definition of HPC

Hearns, John john.hearns at mclaren.com
Wed Apr 17 01:55:09 PDT 2013


>> >
>> >"Owned compute" has some advantages over "rented compute."  In general, the
>> >control one has over one's owned resources enables applications to run with
>> >greater performance.  Some optimizations just demand root access!
>> >
> As someone who has been Scientific Computing/HPC System Admin, I can
> tell you this is a complete myth.

You need at least access to the person with root access. If (s)he's a colleague 
who needs to keep 10 more users in the department happy, you can sit down and 
find a solution if your program doesn't work. Try to convince Amazon that you 
need SHMMAX increased, and that their default processor affinity settings slow 
your code down.

   Herbert

Herbert, hello again!

You make a very good point there.
Very often I find that when diagnosing problems in HPC you have to have root access -
and you have to start with the simple things first - which are almost always the root cause.

For instance, when user jobs crash or are going I'm often asked "Is the Infiniband down?"
I always, always take a step back and look at the nodes running the jobs - look in their system logs etc.
Often you will see things such as port ranges being exhausted for rsh (nto on current systems, but a long time ago),
OOM killer events. Or just that processes from old jobs haven't died properly and are still running on the machine.
99% of the time it is just simple system admin which points the way.
And as you say, you need root access to (say) increase port ranges or futz with the OOM killer tunings.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.



More information about the Beowulf mailing list