[Beowulf] MPI2007 out - strange pop2 results?

Scott Atchley atchley at myri.com
Sat Jul 21 03:43:52 PDT 2007

Hi Gilad,

Presentation at ISC? I did not attend this year and, while I did last  
year, I did not give any presentations. I simply talked to customers  
in our booth and walked the floor. I even stopped by the Mellanox  
booth and chatted awhile. :-)


On Jul 20, 2007, at 9:31 PM, Gilad Shainer wrote:

> Hi Scot,
> I always try to mention exactly what I am comparing to, and not making
> it what it is not. And in most cases, I use the exact same platform  
> and
> mention the details. This makes the information much more credible,
> don't you agree?
> By the way, in the presentation you had at ISC, you did exactly the  
> same
> as what my dear friends from Qlogic did... sorry, I could not  
> resist...
> G
> -----Original Message-----
> From: Scott Atchley [mailto:atchley at myri.com]
> Sent: Friday, July 20, 2007 6:21 PM
> To: Gilad Shainer
> Cc: Kevin Ball; beowulf at beowulf.org
> Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?
> Gilad,
> And you would never compare your products against our deprecated  
> drivers
> and five year old hardware. ;-)
> Sorry, couldn't resist. My colleagues are rolling their eyes...
> Scot
> On Jul 20, 2007, at 2:55 PM, Gilad Shainer wrote:
>> Hi Kevin,
>> I believe that your company is using this list for pure marketing  
>> wars
>> for a long time, so don't be surprise when someone responds back.
>> If you want to put technical or performance data, and than to make
>> conclusions out of it, be sure to compare apples to apples. It is  
>> easy
>> use the lower performance device results of your competitor and than
>> to attack his "architecture" or his entire product line. If this is
>> not a marketing war, than I would be interesting to know what you  
>> call
>> a marketing war....
>> G
>> -----Original Message-----
>> From: Kevin Ball [mailto:kevin.ball at qlogic.com]
>> Sent: Friday, July 20, 2007 11:27 AM
>> To: Gilad Shainer
>> Cc: Brian Dobbins; beowulf at beowulf.org
>> Subject: RE: [Beowulf] MPI2007 out - strange pop2 results?
>> Hi Gilad,
>>   Thank you for the personal attack that came, apparently without  
>> even
>> reading the email I sent.  Brian asked about why the publicly
>> available, independently run MPI2007 results from HP were worse on a
>> particular than the Cambridge cluster MPI2007 results.  I talked  
>> about
>> three contributing factors to that.  If you have other reasons you
>> want to put forward, please do so based on data, rather than engaging
>> in a blatant ad hominem attack.
>>   If you want to engage in a marketing war, there are venues with
>> which to do it, but I think on the Beowulf mailing list data and
>> coherent thought are probably more appropriate.
>> -Kevin
>> On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:
>>> Dear Kevin,
>>> You continue to set world records in providing misleading
>>> information.
>>> You had previously compared Mellanox based products on dual
>>> single-core machines to the "InfiniPath" adapter on dual dual-core
>>> machines and claim that with InfiniPath there are more Gflops....
>>> This
>>> latest release follow the same lines...
>>> Unlike QLogic InfiniPath adapters, Mellanox provide different
>>> InfiniBand HCA silicon and adapters. There are 4 different silicon
>>> chips, each with different size, different power, different price  
>>> and
>>> different performance. There is the PCI-X device (InfiniHost), the
>>> single-port device that was deigned for best price/performance
>>> (InfiniHost III Lx), the dual-port device that was designed for best
>>> performance (InfiniHost III Ex) and the new ConnectX device that was
>>> designed to extend the performance capabilities of the dual port
>>> device. Each device provide different price and performance points
>> (did I said different?).
>>> The SPEC results that you are using for Mellanox, are of the single
>>> port device. And even that device (that its list price is probably
>>> half of your InfiniPath) had better results with  8 server nodes  
>>> than
>> yours....
>>> Your comparison of InfiniPath to the Mellanox single-port device
>>> should have been on price/performance and not on performance.  
>>> Now, if
>>> you want to really compare performance to performance, why don't you
>>> use the dual port device, or even better, ConnectX? Well... I  
>>> will do
>> it for you.
>>> Every time I had compared my performance adapters to yours, your
>>> adapters did not even come close...
>>> Gilad.
>>> -----Original Message-----
>>> From: beowulf-bounces at beowulf.org [mailto:beowulf-
>>> bounces at beowulf.org] On Behalf Of Kevin Ball
>>> Sent: Thursday, July 19, 2007 11:52 AM
>>> To: Brian Dobbins
>>> Cc: beowulf at beowulf.org
>>> Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?
>>> Hi Brian,
>>>    The benchmark 121.pop2 is based on a code that was already
>>> important to QLogic customers before the SPEC MPI2007 suite was
>>> released (POP, Parallel Ocean Program), and we have done a fair
>>> amount
>>> of analysis trying to understand its performance characteristics.
>>> There are three things that stand out in performance analysis on
>>> pop2.
>>>   The first point is that it is a very demanding code on the
>>> compiler.
>>> There has been a fair amount of work on pop2 by the PathScale
>>> compiler
>>> team, and the fact that the Cambridge submission used the PathScale
>>> compiler while the HP submission used the Intel compiler accounts  
>>> for
>>> some (the serial portion) of the advantage at small core counts,
>>> though scalability should not be affected by this.
>>>   The second point is that pop2 is fairly demanding of IO.  Another
>>> example to look at for this is in comparing the AMD Emerald Cluster
>>> results to the Cambridge results;  the Emerald cluster is using NFS
>>> over GigE from a single server/disk, while Cambridge has a much more
>>> optimized IO subsystem.  While on some results Emerald scales  
>>> better,
>>> for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge
>>> scales from 4.29 to 21.0 (4.90X).  The HP system appears to be using
>>> NFS over DDR IB from a single server with a RAID;  thus it should
>>> fall
>>> somewhere between Emerald and Cambridge in this regard.
>>>   The first two points account for some of the difference, but by no
>>> means all.  The final one is probably the most crucial.  The code
>>> pop2
>>> uses a communication pattern consisting of many small/medium sized
>>> (between 512 bytes and 4k) point to point messages punctuated by
>>> periodic tiny (8b) allreduces.  The QLogic InfiniPath architecture
>>> performs far better in this regime than the Mellanox InfiniHost
>>> architecture.
>>>   This is consistent with what we have seen in other application
>>> benchmarking;  even SDR Infiniband based off of the QLogic  
>>> InfiniPath
>>> architecture performs in general as well as DDR Infiniband based on
>>> the Mellanox InfiniHost architecture, and in some cases better.
>>> Full disclosure:  I work for QLogic on the InfiniPath product line.
>>> -Kevin
>>> On Wed, 2007-07-18 at 18:50, Brian Dobbins wrote:
>>>> Hi guys,
>>>>   Greg, thanks for the link!  It will no doubt take me a little
>>>> while to parse all the MPI2007 info (even though there are only a
>>>> few submitted results at the moment!), but one of the first  
>>>> things I
>>>> noticed was that performance of pop2 on the HP blade system was
>>>> beyond
>>>> atrocious... any thoughts on why this is the case?  I can't see any
>>>> logical reason for the scaling they have, which (being the first
>>>> thing
>>>> I noticed) makes me somewhat hesitant to put much stock into the
>>>> results at the moment.  Perhaps this system is just a statistical
>>>> blip
>>>> on the radar which will fade into noise when additional results are
>>>> posted, but until that time, it'd be nice to know why the results
>>>> are the way they are.
>>>>   To spell it out a bit, the reference platform is at 1 (ok, 0.994)
>>>> on
>>>> 16 cores, but then the HP blade system at 16 cores is at 1.94.  Not
>>>> bad there.  However, moving up we have:
>>>>   32 cores   - 2.36
>>>>   64 cores  -  2.02
>>>>  128 cores -  2.14
>>>>  256 cores -  3.62
>>>>   So not only does it hover at 2.x for a while, but then going from
>>>> 128 -> 256 it gets a decent relative improvement.  Weird.
>>>>   On the other hand, the Cambridge system (with the same processors
>>>> and a roughly similar interconnect, it seems) has the follow  
>>>> scaling
>>>> from 32->256 cores:
>>>>    32 cores - 4.29
>>>>    64 cores - 7.37
>>>>   128 cores - 11.5
>>>>   256 cores - 15.4
>>>>   ... So, I'm mildly confused as to the first results.  Granted,
>>>> different compilers are being used, and presumably there are other
>>>> differences, too, but I can't see how -any- of them could result in
>>>> the scores the HP system got.  Any thoughts?  Anyone from HP (or
>>>> QLogic) care to comment?  I'm not terribly knowledgeable about the
>>>> MPI
>>>> 2007 suite yet, unfortunately, so maybe I'm just overlooking
>>>> something.
>>>>   Cheers,
>>>>   - Brian
>>>> ___________________________________________________________________ 
>>>> _
>>>> __ _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org To change your
>>>> subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org To change your  
>>> subscription
>>> (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org To change your subscription
>> (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list