On Sat, Feb 14, 2009 at 6:43 PM, David Mathog <span dir="ltr"><<a href="mailto:mathog@caltech.edu">mathog@caltech.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="Ih2E3d">Tiago Marques <<a href="mailto:a28427@ua.pt">a28427@ua.pt</a>><br>

<br>

<br>

</div><div class="Ih2E3d">> I've been trying to get the best performance on a small cluster we<br>

have here<br>

> at University of Aveiro, Portugal, but I've not been enable to get most<br>

> software to scale to more than one node.<br>

<br>

</div><SNIP><br>

<div class="Ih2E3d"><br>

> The problem with this setup is that even calculations that take more<br>

than 15<br>

> days don't scale to more than 8 cores, or one node. Usually performance is<br>

> lower with 16cores, 12 cores, than with just 8. From what I've been<br>

reading,<br>

> I should be able to scale fine at least till 16 cores and 32 for some<br>

> software.<br>

<br>

</div><SNIP><br>

<div class="Ih2E3d">><br>

> I tried with Gromacs to have two nodes using one processor each, to<br>

check if<br>

> 8 cores were stressing the GbE too much, and the performance dropped too<br>

> much compared with running two CPUs on the same node.<br>

<br>

</div>Lots of possibilities here. Most of them are probably coming down to the<br>

code not being written to make good use of a cluster environment, and/or<br>

there not being any way to do that (single threaded code with a lot of<br>

unpredictable branching).<br>

<br>

For Gromacs I suggest you ask on that mailing list.  My recollection is<br>

that it was known to scale poorly, but that was a couple of years ago,<br>

and maybe they have improved it since then.  If it doesn't scale you can<br>

always get more throughput by running one independent job on each of<br>

your nodes, using local storage to avoid network contention to the file<br>

server.  It may take 15 days to finish a run, but at least you'll have N<br>

times more work completed.  Running N independent jobs will give you at<br>

least as much throughput as running 1 job on N cores.  Admittedly it is<br>

nice to have the results in 1/Nth the time.<br>

</blockquote><div></div><div>Already did that, not too many helpful people on Gromacs list... They just told me to wait for 4.0 version, which I did, which scales better, though still not as I hoped.</div><div></div><div>

Were already running a single job per node for months but it would be good to have the chance to run jobs faster, sometimes it's needed.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Some of what you may be seeing with poorer performance on more cores on<br>

one node is probably related to the effect on memory access, especially<br>

through cache.  Code that can go in and out of cache runs much faster<br>

than anything which has to go to main memory, and as soon as you run two<br>

competing (which depends on architecture) processes you may find that<br>

the two programs are throwing each other's data out of any shared cache,<br>

which can result in dramatic slowdowns.<br>

<br>

Give gprof a shot too.  You want to see where your code is spending most<br>

of its time.  If it spends 95% of its time in routines with no network<br>

IO, then the network is likely not your issue.  And vice versa.<br>

<div class="Ih2E3d"></div></blockquote><div></div><div>I have thought of that, but I didn't manage to do it on the more important codes. It compiles but just doesn't spit out the profiling output.</div><div></div>

<div>I have used "iftop" to measure network usage and it's probably around 300-400Mbit/s, so I was poiting the problem at latency, throughput seems fine. While copying files with "scp", I can get 93MB/s.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">

> unexpected for me, since the benchmarks I've seen on Gromacs website state<br>

> that I should be able to have 100% scaling on this case, sometimes more.<br>

<br>

</div>Contact the person who said that, get the exact conditions, and see if<br>

you can replicate them.  You might have a network issue, but unless you<br>

are comparing apples to apples it may be hard to figure it out.</blockquote><div></div><div>True. Thanks for the help.</div><div></div><div>I must ask, doesn't anybody on this list run like 16 cores on two nodes well, for a code and job that completes like in a week?</div>

<div>Or most code that gets done in a week/two weeks only scales with InfiniBand and the like? For like 99% of the cases.</div><div></div><div>Best regards,</div><div>                         Tiago Marques</div><div> </div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>

<br>

Regards,<br>

<font color="#888888"><br>

David Mathog<br>

<a href="mailto:mathog@caltech.edu">mathog@caltech.edu</a><br>

Manager, Sequence Analysis Facility, Biology Division, Caltech<br>

</font><div><div class="Wj3C7c">_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</div></div></blockquote><br>