[Beowulf] MPI2007 out - strange pop2 results?

Fri Jul 20 11:27:23 PDT 2007

Hi Gilad,

  Thank you for the personal attack that came, apparently without even
reading the email I sent.  Brian asked about why the publicly available,
independently run MPI2007 results from HP were worse on a particular
than the Cambridge cluster MPI2007 results.  I talked about three
contributing factors to that.  If you have other reasons you want to put
forward, please do so based on data, rather than engaging in a blatant
ad hominem attack.

  If you want to engage in a marketing war, there are venues with which
to do it, but I think on the Beowulf mailing list data and coherent
thought are probably more appropriate.

-Kevin

On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:
> Dear Kevin,
> 
> You continue to set world records in providing misleading information.
> You had previously compared Mellanox based products on dual single-core
> machines to the "InfiniPath" adapter on dual dual-core machines and
> claim that with InfiniPath there are more Gflops.... This latest release
> follow the same lines...
> 
> Unlike QLogic InfiniPath adapters, Mellanox provide different InfiniBand
> HCA silicon and adapters. There are 4 different silicon chips, each with
> different size, different power, different price and different
> performance. There is the PCI-X device (InfiniHost), the single-port
> device that was deigned for best price/performance (InfiniHost III Lx),
> the dual-port device that was designed for best performance (InfiniHost
> III Ex) and the new ConnectX device that was designed to extend the
> performance capabilities of the dual port device. Each device provide
> different price and performance points (did I said different?).
> 
> The SPEC results that you are using for Mellanox, are of the single port
> device. And even that device (that its list price is probably half of
> your InfiniPath) had better results with  8 server nodes than yours....
> Your comparison of InfiniPath to the Mellanox single-port device should
> have been on price/performance and not on performance. Now, if you want
> to really compare performance to performance, why don't you use the dual
> port device, or even better, ConnectX? Well... I will do it for you.
> Every time I had compared my performance adapters to yours, your
> adapters did not even come close...
> 
> 
> Gilad. 
> 
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Kevin Ball
> Sent: Thursday, July 19, 2007 11:52 AM
> To: Brian Dobbins
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?
> 
> Hi Brian,
> 
>    The benchmark 121.pop2 is based on a code that was already important
> to QLogic customers before the SPEC MPI2007 suite was released (POP,
> Parallel Ocean Program), and we have done a fair amount of analysis
> trying to understand its performance characteristics.  There are three
> things that stand out in performance analysis on pop2.
> 
>   The first point is that it is a very demanding code on the compiler. 
> There has been a fair amount of work on pop2 by the PathScale compiler
> team, and the fact that the Cambridge submission used the PathScale
> compiler while the HP submission used the Intel compiler accounts for
> some (the serial portion) of the advantage at small core counts, though
> scalability should not be affected by this.
> 
>   The second point is that pop2 is fairly demanding of IO.  Another
> example to look at for this is in comparing the AMD Emerald Cluster
> results to the Cambridge results;  the Emerald cluster is using NFS over
> GigE from a single server/disk, while Cambridge has a much more
> optimized IO subsystem.  While on some results Emerald scales better,
> for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge scales
> from 4.29 to 21.0 (4.90X).  The HP system appears to be using NFS over
> DDR IB from a single server with a RAID;  thus it should fall somewhere
> between Emerald and Cambridge in this regard.
> 
>   The first two points account for some of the difference, but by no
> means all.  The final one is probably the most crucial.  The code pop2
> uses a communication pattern consisting of many small/medium sized
> (between 512 bytes and 4k) point to point messages punctuated by
> periodic tiny (8b) allreduces.  The QLogic InfiniPath architecture
> performs far better in this regime than the Mellanox InfiniHost
> architecture.
> 
>   This is consistent with what we have seen in other application
> benchmarking;  even SDR Infiniband based off of the QLogic InfiniPath
> architecture performs in general as well as DDR Infiniband based on the
> Mellanox InfiniHost architecture, and in some cases better.
> 
> 
> Full disclosure:  I work for QLogic on the InfiniPath product line.
> 
> -Kevin
> 
> 
> On Wed, 2007-07-18 at 18:50, Brian Dobbins wrote:
> > Hi guys,
> > 
> >   Greg, thanks for the link!  It will no doubt take me a little while 
> > to parse all the MPI2007 info (even though there are only a few 
> > submitted results at the moment!), but one of the first things I 
> > noticed was that performance of pop2 on the HP blade system was beyond
> 
> > atrocious... any thoughts on why this is the case?  I can't see any 
> > logical reason for the scaling they have, which (being the first thing
> 
> > I noticed) makes me somewhat hesitant to put much stock into the 
> > results at the moment.  Perhaps this system is just a statistical blip
> 
> > on the radar which will fade into noise when additional results are 
> > posted, but until that time, it'd be nice to know why the results are 
> > the way they are.
> > 
> >   To spell it out a bit, the reference platform is at 1 (ok, 0.994) on
> > 16 cores, but then the HP blade system at 16 cores is at 1.94.  Not 
> > bad there.  However, moving up we have:
> >   32 cores   - 2.36
> >   64 cores  -  2.02
> >  128 cores -  2.14
> >  256 cores -  3.62
> > 
> >   So not only does it hover at 2.x for a while, but then going from
> > 128 -> 256 it gets a decent relative improvement.  Weird.
> >   On the other hand, the Cambridge system (with the same processors 
> > and a roughly similar interconnect, it seems) has the follow scaling 
> > from 32->256 cores:
> > 
> >    32 cores - 4.29
> >    64 cores - 7.37
> >   128 cores - 11.5
> >   256 cores - 15.4
> > 
> >   ... So, I'm mildly confused as to the first results.  Granted, 
> > different compilers are being used, and presumably there are other 
> > differences, too, but I can't see how -any- of them could result in 
> > the scores the HP system got.  Any thoughts?  Anyone from HP (or
> > QLogic) care to comment?  I'm not terribly knowledgeable about the MPI
> > 2007 suite yet, unfortunately, so maybe I'm just overlooking 
> > something.
> > 
> >   Cheers,
> >   - Brian
> > 
> > 
> > ______________________________________________________________________
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org To change your subscription 
> > (digest mode or unsubscribe) visit 
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
> (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf