Performance Variations using MPI/Myrico

James O'Connor oconnor at
Fri Apr 27 10:04:35 PDT 2001

Steffen Persvold wrote:
>Patrick Geoffray wrote:
>> Steffen Persvold wrote:
>> > Hmm, the NAS application runs in userspace and since this inner loop
>> > (FFT code) runs without any communication with other nodes, why would a
>> > SSE patched kernel improve it's memcpy performance. I would believe that
>> > the memcpy calls in the FFT code was either inlined by the compiler, or
>> > that a call to libc's memcpy was made. It shouldn't involve any
>> > system (kernel) time at all, right ??
>> Hi Steffen,
>> Yes, the NAS FT code does not use the "memcpy()" system call. The copy
>> step of the FFT is explicit (loop of assignments) and the PGI compiler
>> is smart enough to use SSE prefetching to optimize this part of the code
>> if SSE is available. But without a specific patch, the Linux kernel does
>> not enable the SSE support (basically the kernel has to save the FP and
>> the SSE registers during context switching), so the SSE optimization for
>> PIII from PGI is useless. Now I am wondering if compiling with
>> -Mvect=sse or -Mvect=prefetch with pgf90 WITHOUT the SSE support enabled
>> in the kernel is not the source of this unstability.
>Actually, running SSE code (involving any SSE "mov" instructions) on a kernel
>wich doesn't save the SSE registers between context switches would result in a
>segmentation fault.....
>I have learned this the hard way :
>The original RH6.2 kernel (2.2.14-5.0) had PIII support and therefore saving of
>SSE registers, but when RH released a kernel update because they experienced
>data loss during context switches (RHBA-2000:013-01), I upgraded to
>2.2.14-6.0.1. This kernel however did not have SSE support enabled, and my hand
>coded SSE routines suddenly caused a segmentation fault.

There are two parts to having SSE support in the kernel. The first is 
setting the OSFXSR bit in CR4 which tells the processor that the OS
supports a SSE context save.  The second is to actually do an SSE 
context save.  If the bit in CR4 isn't set the SSE instructions 
(excluding the sfence, etc.) are treated as unsupported and may well
generate your segmentation fault.  If the flag is set, but the OS
doesn't really save context correctly, you may still be able to 
successfully run as long as your program stays on the same processor
and no other program is using the SSE registers.

>There are however some SSE instructions that doesn't require a context switch
>save of registers (i.e "sfence" and "prefetchnta")
>> Anyway, 50 % of variation for a pure computation piece of code seems too
>> large to be explained by the SSE support. SSE on PIII is single
>> precision only, so it does not help to get more Flops. Maybe there is
>> something else in the patch that they applied, I will look at it.
>I agree.
> Steffen Persvold                        Systems Engineer
> Email  : mailto:sp at            Scali AS (
> Norway : Tel  : (+47) 2262 8950         Olaf Helsets vei 6
>          Fax  : (+47) 2262 8951         N-0621 Oslo, Norway
> USA    : Tel  : (+1) 713 706 0544       10500 Richmond Avenue, Suite 190
>                                         Houston, Texas 77042, USA
>Beowulf mailing list, Beowulf at
>To change your subscription (digest mode or unsubscribe) visit

Jim O'Connor
SRC Computers, Inc.
oconnor at

More information about the Beowulf mailing list