Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: dual core (latency)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Vincent Diepeveen diep at xs4all.nl
Mon Jul 18 19:50:09 PDT 2005


Hello Stuart,

Thanks for your answer regarding numactl tools.

Your answer doesn't necessarily explain why the dual core latency (with or
without numactl) is far worse, yes 30%+ worse, than that of single cpu
opterons of the same speed, when benchmarking just 1 core (so the others
sitting idle).

Any thoughts on that?

Thanks,
Vincent

At 08:17 AM 7/19/2005 +0800, Stuart Midgley wrote:
>The numactl tools won't generally help latency.  Latency isn't the  
>issue with Opteron based systems (or any system with multiply  
>connected distributed memory controllers).
>
>The real issue is page locality (which is the case with most numa  
>based systems).
>
>If you run 2 processes on a dual cpu (single core) systems and they  
>both happen to allocate their pages on the same memory controller,  
>they will each only see 1/2 the memory bandwidth and 1 controller  
>sits idle.  That's the real issue (and the extreme pathalogical case).
>
>Linux2.6 generally does a good job of putting the pages on the memory  
>controller attached to cpu that the process is running on.  However,  
>it can't get it perfect.  There are always more than 1process/cpu on  
>a system, so there is always a little noise... so there is always the  
>chance that some pages can be spread around.  Also, the system buffer  
>cache will get spread around effecting everyone.
>
>Add into the mix the possibility of suspending processes and you can  
>end up with a processes pages all over the place.  Since Linux  
>doesn't yet have make migration, once a page is allocated it won't be  
>moved to a different memory controller unless it is swapped out.
>
>With numactl tools you will force the pages to be allocated on the  
>right memory/cpu.  The processes buffer cache will also be locked  
>down (which is another VERY important issue)...
>
>I have used numa tools to double the performance of some codes (or  
>perhaps its more correct to say to get back to the correct performance).
>
>Stu.
>
>
>On 18/07/2005, at 22:38, Vincent Diepeveen wrote:
>
>> I've been toying some with the numactl at dual core and it doesn't
>> really seem to help much. It helps 0.00
>>
>> System: Ubuntu at a quad opteron dual core 1.8Ghz  2.6.10-5 smp  
>> kernel.
>>
>> Latencies as measured by my own program (TLB trashing read of 8 bytes,
>> each cpu 250MB buffer):
>>
>> #cpu latency
>> 1   144-147 ns
>> 2   174 ns
>> 4   206 ns
>> 8   234 ns
>>
>> That single cpu figure is pretty ugly bad if i may say so.
>>
>> All kind of numa calls just didn't help a thing. I've tried for  
>> example:
>>
>>   if(numa_available() < 0 ) {
>>     setitnuma = 0;
>>   }
>>   else {
>>     int i,back;
>>     nodemask_t nt,n2,rnm;
>>     maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
>>     printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);
>>
>>     nt = numa_get_interleave_mask();
>>     for( i = 0 ; i < maxnodes ; i++ ) {
>>       printf("node = %i mask = %i\n",i,nt.n[i]);
>>       nt.n[i] = 0;
>>       n2.n[i] = 0;
>>     }
>>     numa_set_interleave_mask(&nt);
>>     nt = numa_get_interleave_mask();
>>     for( i = 0 ; i < maxnodes ; i++ )
>>       printf("checking memory interleave node = %i mask = %i 
>> \n",i,nt.n[i]);
>>
>>     rnm = numa_get_run_node_mask();
>>     printf("numa get run node mask = %i\n",rnm);
>>     back = numa_run_on_node(0);
>>     if( !back )
>>       printf("set to run on node 0\n");
>>     else
>>       printf("failed to set run on node 0\n");
>>
>>   }
>>
>> Whatever i try, single cpu latency keeps 144-147 ns.
>>
>> A dual opteron dual core with 2.2Ghz dual core controllers shows  
>> similar
>> latencies. 200 ns for example when running 4 processes with the same
>> testprogram.
>>
>> This single cpu latency behaviour of dual core opteron is ugly bad
>> compared to other dual opterons which are not dual core.
>>
>> Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives  
>> single cpu
>> with SAME kernel, with SAME program 115 ns latency. When turning  
>> off ECC at
>> that dual opteron it gets down to 113 ns even.
>>
>> The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
>> whereas the quad opteorn dual core has all banks filled
>> with pc3200 registered ram, a-brand.
>>
>> Vincent
>
>
>--
>Dr Stuart Midgley
>sdm900 at gmail.com
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>



More information about the Beowulf mailing list