[Beowulf] lm_sensors and clusters and wrong intel cpu readings

Andrew Holway andrew.holway at gmail.com
Thu Aug 9 00:34:14 PDT 2012


On AMD sensors at least the reading is a 'relative value' with 70C
indicating an overheat.

The processor is fine unless it is actually clocking its self down to
a lower ACPI power state. It is seemingly impossible to overheat
modern CPUS. When its clocking down you should see messages in
/var/log/messages and also the ipmi SEL log.

Dont trust IPMI. Ever. lm_sensors actually reads the raw value from
the CPU and requires a specifically written kernel module to do so.
Who knows what kind of junk math the ipmi does.

Supermicro IPMI implementations are particularly bad at reporting temp
properly(like really awful).

Most of my experience here is with AMD. ymmv :)



2012/8/9 John Hearns <hearnsj at googlemail.com>:
> Well, I don't use lm_sensors for a start!
> Use the ipmitool utility to probe the readings from BMC cards (ILO,
> DRAC, they're the same thing).
> I don;t trust the absolute calibration of the sensors - generally
> you're looking at setting a limit on which to alarm or shutdown so
> just take a reading under no load on the CPU and call that the
> 'normal' reading.
> I may be wrong. YMMV.
>
> On 08/08/2012, Vincent Diepeveen <diep at xs4all.nl> wrote:
>> hi,
>>
>> How do you guys monitor the CPU core temperatures?
>>
>> if i run lm_sensors, it's 30C higher at every node than a few nodes i
>> tried compare with windows.
>> Also under full load it reports temperatures like end 60s and up to
>> 78C i've seen reported.
>> Am guessing it should be 30-40+ at most.
>>
>> It blows cool air from and outside the cpu's. Nothing is even 'warm'.
>>
>> Nodes here: supermicro X7DWE inside Xeons L5420. They are not
>> overclocked.
>>
>> I also downloaded some similar motherboards definitions - seems they
>> uploaded it for motherboards with dual core Xeons
>> and such, not for the quadcores. None of those defines 'corrects' the
>> temperature of the quadcore Xeons, they basically kick out
>> readings that are not getting used.
>>
>> Now i bet several clusters/supercomputers had these cpu's. How did
>> you solve this problem with the intel L5420's?
>>
>> Maybe someone still has the lm_sensors script lying around somewhere
>> fixing it for the intel Xeons?
>>
>> Thanks in advance,
>> Vincent
>>
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list