[Beowulf] Multisocket mainboard hardware problems

Bruno Coutinho coutinho at dcc.ufmg.br
Thu Jan 15 15:28:22 PST 2009


Some common cpu tests:

- linpack
- mprime: http://www.mersenne.org/freesoft/
- compile a kernel

Linpack and mprime are great to do cpu burn in tests.
Mprime has a option to verify results, so you can detect aritmetic errors
and there's a option for testing your machine without joining the grid.



2009/1/15 Jon Aquilina <eagles051387 at gmail.com>

> try running memtest+86 its a cd that you boot on to that tests the memory
> leave it running for a few hrs to makes sure it is the ram or sockets. i am
> not sure about how to test the cpu.
>
> On Tue, Jan 13, 2009 at 10:26 AM, Francesco Pietra <
> francesco.pietra at accademialucchese.it> wrote:
>
>> Hi:
>>
>> I am posting here from a suggestion on the Debian amd64 site. My
>> original posting to the mainboard factory/vendor in Europe only
>> resulted in uninteresting suggestions, and they did not answer any
>> more.
>>
>> My question is directed to the attention of users familiar with
>> multisocket UMA-type mainboards based on 875 dual opteron AMD CPU. My
>> own is Supermicro H8QC8 with chipset nVidia CK804 and AMD 8132, driven
>> by Debian Linux amd64 lenny.
>>
>> One of the CPUs has suddenly lost viability to its
>> 4-slots memory bank (shut down the machine in order, the problem arose on
>> next
>> loading Linux). Still, the CPU cores are OK, hypertransport links are
>> fully working, parallelization to both Amber 10 and NWChem 5.1 is
>> fully provided, but one of the CPUs must be slower, having to borrow
>> memory from the other
>> banks. The hardware status, after a period of complete darkness, is
>> described in the attached lshw_deb64_7Jan2009.txt.
>>
>> As each bank of Kingston DDR1 is filled 2+2+1+1 GB, I identified the
>> faulty bank, removed all slots from there, and replaced the 1+1 GB
>> slots at another bank with 2 + 2 GB from the faulty bank, so that now
>> the computer is at 20GB. The situation is described in the attached
>> lshw_deb64_lessCPU2_scrambling1G_2G_CPU4_7Jan2009.txt. Actually,
>> identification of the CPU (CPU2) related to the faulty mem bank is
>> insecure: I just considered the nearest CPU to the faulty bank. The
>> manual is not helpful to this regard .
>>
>> I understand that, in order to remove non-mainboard causes, I should
>> be certain that a CPU has not lost memory control. Since replacing (I
>> have one spare second-hand CPU) or scrambling, the CPUs is quite
>> troublesome, and risky, in my context (there is very little space
>> around the mainboard in the rack that I engineered to accept the
>> mainboard). Ventilation is excellent, however.
>>
>> Therefore, is it any software way to check if the CPUs are fully in
>> order, including the memory controller? lshw and other software
>> provided only partial help in my hands.
>>
>> Also any other suggestion would be greatly appreciated.
>>
>> Thanks for your kind attention
>>
>> francesco pietra
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> Jonathan Aquilina
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090115/65ebc878/attachment.html>


More information about the Beowulf mailing list