[Beowulf] MPI performance on clusters of SMP
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Franz Marini franz.marini at mi.infn.itFri Aug 27 00:47:57 PDT 2004
- Previous message: [Beowulf] MPI performance on clusters of SMP
- Next message: [Beowulf] MPI performance on clusters of SMP
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 2004-08-26 at 20:22, Kozin, I (Igor) wrote: > Thus using both cpus on a node creates even higher load on the > connection. Hypothetically, when the memory bandwidth and > the switch are not a problem then > using N x 2 configuration > with 2 network cards per node should be always superior to > using 2*N x 1 config with 1 network card per node. > (same number of cards and cpus!). Err. Not always ;) As Philippe pointed out, it depends on the architecture of the dual cpu boxes, too. With Opterons it should be effectively faster using Nx2 w/ 2 nics than 2Nx1 w/ 1 nic (or at least, it shouldn't be slower). With Xeons, I'd say that most of the time the opposite is true. This is due to the different memory architecture of the two cpus. Opterons have integrated memory controllers, while Xeons share a single, off-chip memory controller. Keep in mind that (unless your problem entirely fits on L1/L2 cache), using a 2 cpu box doesn't guarantee you to have 2x the speed (in fact, it never reaches 2x the speed). Usually, you'll see a 60-90% improvement over a single processor. With Opterons (I'm assuming we're only talking about x86 architectures, and thus I'm not considering mips, pa-risc, and so on) this figures are usually higher than with Xeons, mainly because of the memory architecture used by the two. To give you an example of the complexity of the problem, on our dual-Xeons, Infiniband cluster, using both Gromacs and CPMD, we are able to achieve a much better scalability using 2*N x 1 cpus than N x 2. It looks like dual-Xeons have a big performance hit when the two cpus run the same process (thus with the same memory access pattern). Please note that if you run a copy of Gromacs on N x 1 and at the same time you run a copy of CPMD on N x 1 (thus using all 2*N cpus, but with different processes between each of the two cpus of each node, and thus different mem access patterns), you achieve almost perfect scalability with both programs. Thus, it looks like Xeons are heavily (negatively) impacted by the mem access pattern of the processes running on the cpus of a smp box. I've been told that with Opterons this kind of problems are much less present. Oh, btw, the same olds true if you launch two Gromacs instances each on N x 1 at different times (or with different inputs), so it really looks like it is a problem directly related to the memory access patterns of the processes running on the cpus. In the end, I'd say that if you plan to run only a single copy of a single programs across the whole cluster, you'd better off with a 2*N x 1 solution. On the other hand, if you plan to run different programs at the same time, a N x 2 solution is much better (you have much lower costs (provided we're talking about high speed interconnections, because fast ethernet and even gigabit are quite cheap right now) with almost the same (global) performance). Have a good day, Franz --------------------------------------------------------- Franz Marini Sys Admin and Software Analyst, Dept. of Physics, University of Milan, Italy. email : franz.marini at mi.infn.it phone : +39 02 50317221 ---------------------------------------------------------
- Previous message: [Beowulf] MPI performance on clusters of SMP
- Next message: [Beowulf] MPI performance on clusters of SMP
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
