[Beowulf] Re:Strange hardware? problems

David Mathog mathog at caltech.edu
Fri Apr 27 12:33:24 PDT 2007


Orion Poplawski <orion at cora.nwra.com> wrote:
> 
> We've got two pairs of identical machines:
> 
> - 2 Tyan S2882 dual processor Opteron 244 stepping 10
> - 2 Tyan S2882-D dual processor dual core Opteron 275 stepping 2
> 
> We have two (relatively complicated) numerical models (RAMS and a 
> homegrown one) that will blow up in random locations on the 244 machines 
> but run fine on the 275 machines.

Since the same code runs differently on two different Opteron models
it's probably either a memory access issue or the use of a compiler
flag that enables some feature on one model that is not present
on the other.  For instance, SSE3 vs. SSE2, although I don't know
enough about these models to tell you what the most likely flag would
be.  (The fact that it runs ok on the newer one and blows up on the
older one is consistent with this type of error.)

Assuming gcc, recompile with:

  -O0 -g -std=c99 -Wall

and clean up any warnings that result until you get a clean build.
Repeat with -O3 and -O2, as for strange reasons that sometimes uncovers
logic problems not seen at -O0.  Then run the resulting binary
within valgrind.  Fix any memory access violations which are found.
Valgrind can also alert you to the use of unsupported operations.

If this code is linked to shared libraries those libraries might
also be the source of this problem.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list