very high bandwidth, low latency manner?

Tue Apr 16 03:24:37 PDT 2002

Hi,

I am sorry to hear that you was unable to achieve expected performance on 
the mentioned SCI based systems. You raise a couple of issues, which I 
would like to address:

1) Performance.

Performance transparency is always goal. Nevertheless, sometimes an 
implementation will have a performance bug. The two organizations owning 
the mentioned systems, have both support agreements with Scali. I have 
checked the support requests, but cannot find any request where your 
incidents were reported. We find this fact strange if you truly were aiming 
at achieving good performance. We are happy to look into your application 
and report findings back to this news group.

2) Startup time.

You contribute the bad scalability to high startup time and mapping of 
memory. This is an interesting hypothesis; and can easily be verified by 
using a switch when you start the program, and measure the difference 
between the elapsed time of the application and the time it uses after 
MPI_Init() has been called. However, the startup time measured on 64-nodes, 
two processors per node, where all processes have set up mapping to all 
other processes, is nn second. If this contributes to bad scalability, your 
application has a very short runtime.

3) SCI ring structure

You state that on a multi user, multi-process environment, it is hard to 
get deterministic performance numbers. Indeed, that is true. True sharing 
of resources implies that. Whether the resource is a file-server, a memory 
controller, or a network component, you will probably always be subject to 
performance differences. Also, lack of page coloring will contribute to 
different execution times, even for a sequential program. You further 
indicate that performance numbers reported f. ex. by Pallas PMB benchmark 
only can be used for applying for more VC. I disagree for two reasons; 
first, you imply that venture capitalists are naive (and to some extent 
stupid). That is not my impression, merely the opposite. Secondly, such 
numbers are a good example to verify/deny your hypothesis that the SCI ring 
structure is volatile to traffic generated by other applications. PMB's 
*multi* option is architected to investigate exactly the problem you 
mention; Run f. ex. MPI_Alltoall() on N/2 of the machine. Then measure how 
performance is affected when the other N/2 of the machine is also running 
Alltoall(). This is the reason we are interested in comparative performance 
numbers to SCI based systems. It is to me strange, that no Pallas PMB 
benchmark results ever has been published for a reasonable sized system 
based on alternative interconnect technologies. To quote Lord Kelvin: "If 
you haven't measured it, you don't know what you're talking about".

As a bottom line, I would appreciate that initiatives to compare cluster 
interconnect performance should be appreciated, rather than be scrutinized 
and be phrased as "only usable to apply for more VC".

H
At 11:40 AM 4/15/02 +0200, Markus Fischer wrote:
>Steffen Persvold wrote:
> >
> > Now we have price comparisons for the interconnects (SCI,Myrinet and
> > Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> > ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> > node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII 
> ServerWorks
> > HE-SL based cluster).
>
>yes, please.
>
>I would like to get/see some numbers.
>I have run tests with SCI for a non linear diffusion algorithm on a 96 node
>cluster with 32/33 interface. I thought that the poor
>scalability was due to the older interface, so I switched to
>a SCI system with 32 nodes and 64/66 interface.
>
>Still, the speedup values were behaving like a dog with more than 8 nodes.
>
>Especially, the startup time will reach minutes which is probably due to
>the exporting and mapping of memory.
>
>Yes, the MPI library used was Scampi. Thus, I think the
>(marketing) numbers you provide
>below are not relevant except for applying for more VC.
>
>Even worse, we noticed, that the SCI ring structure has an impact on the
>communication pattern/performance of other applications.
>This means we only got the same execution time if other nodes were
>I idle or did not have communication intensive applications.
>How will you determine the performance of the algorithm you just invented
>in such a case ?
>
>We then used a 512 node cluster with Myrinet2000. The algorithm scaled
>very fine up to 512 nodes.
>
>Markus
>
> >
> > Regards,
> > --
> >   Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
> >  mailto:sp at scali.com |  http://www.scali.com  | performing MPI 
> implementation:
> > Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
> > Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS 
> latency
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

--
Håkon Bugge; VP Product Development; Scali AS;
mailto:hob at scali.no; http://www.scali.com; fax: +47 22 62 89 51;
Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
Mail Addr:  Scali AS, Postboks 150, Oppsal, N-0619  Oslo, Norway;