[Beowulf] Problems scaling performance to more than one node, GbE

Fri Feb 13 09:23:12 PST 2009

Hi all,

I've been trying to get the best performance on a small cluster we have here
at University of Aveiro, Portugal, but I've not been enable to get most
software to scale to more than one node.

Our specs are as follows:

- HP c-7000 Blade enclosure,
- 8 Blades BL460c:

   - Dual Xeon Quad-core E5430, 2.66 GHz
   - 8GiB FB-DIMM DDR2-667
   - Dual Gigabit Ethernet
   - Dual 146GB 10K RPM SAS HDD in RAID1

The dual GbE are based on Broadcom's NetXtreme II, installed with the driver
from 2.6.26-gentoo-sources kernel, and they are connected to the internal
switch, which seems a Nortel one, rebranded HP.

The problem with this setup is that even calculations that take more than 15
days don't scale to more than 8 cores, or one node. Usually performance is
lower with 16cores, 12 cores, than with just 8. From what I've been reading,
I should be able to scale fine at least till 16 cores and 32 for some
software.

I tried with Gromacs to have two nodes using one processor each, to check if
8 cores were stressing the GbE too much, and the performance dropped too
much compared with running two CPUs on the same node. This is sort of
unexpected for me, since the benchmarks I've seen on Gromacs website state
that I should be able to have 100% scaling on this case, sometimes more.

2 cores, 1 node, 1500 steps ---> 361s (2.6.26-r4 and 2.6.24-r6, no IPv6,
icc)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
2 cores, 2 node, 1500 steps ---> 499s (2.6.26-r4, no GROUP SCHEDULER, icc)

To be more precise. This particular benchmark is the only one that is
stressful enough to give me a benefit to go from 8 to 16 cores.

8 cores, 1 node, 1500 steps ---> 101s (2.6.26-r4, no GROUP SCHEDULER, icc)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
16 cores, 2 nodes, 1500 steps ---> 65s, 65s (no IPv6, 2.6.26, 2.6.26)

The rest of typical calculations done here are less heavy and have worse
performance running in 16 cores than 8. I also don't know if this is just a
case of two "easy" calculations or really hardware - which I find strange
since calculations that take up 15days in 8 cores aren't able to run faster
in 16.

>From what I could also digg around, it seems that some switches have too
much latency and hamper any kind of proper performance from GbE. Were this
my case, any benchmarks I could use to test that theory out?

One particular software, VASP, doesn't scale to more than 6 cores, which
seems to be a bandwidth problem due to the FSB used in Xeons, but the other
software behaves quite well.

As for software, I'm using Gentoo Linux, ICC/IFC/GotoBLAS, tried scalapack
with no benefit, OpenMPI and Torque, running in x86-64 mode.

Any help will be very welcomed.

Best regards,

                         Tiago Marques
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090213/90f2d6bb/attachment.html>