[Beowulf] How to debug slow compute node?
hearnsj at googlemail.com
Thu Aug 10 08:17:20 PDT 2017
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)
On 10 August 2017 at 17:00, John Hearns <hearnsj at googlemail.com> wrote:
> ps. Look at watch cat /proc/interrupts also
> You might get a qualitative idea of a huge rate of interrupts.
> On 10 August 2017 at 16:59, John Hearns <hearnsj at googlemail.com> wrote:
>> I think you might have to buy me a virtual coffee. Or a beer!
>> Please look at the hardware health of that machine. Specifically the
>> DIMMS. I have seen this before!
>> If you have some DIMMS which are faulty and are generating ECC errors,
>> then if the mcelog service is enabled
>> an interrupt is generated for every ECC event. SO the system is spending
>> time servicing these interrupts.
>> So: look in your /var/log/mcelog for hardware errors
>> Look in your /var/log/messages for hardware errors also
>> Look in the IPMI event logs for ECC errors: ipmitool sel elist
>> I would also bring that node down and boot it with memtester.
>> If there is a DIMM which is that badly faulty then memtester will
>> discover it within minutes.
>> Or it could be something else - in which case I get no coffee.
>> Also Intel cluster checker is intended to exacly deal with these
>> What is your cluster manager, and is Intel CLuster Checker available to
>> I would seriously look at getting this installed.
>> On 10 August 2017 at 16:39, Faraz Hussain <info at feacluster.com> wrote:
>>> One of our compute nodes runs ~30% slower than others. It has the exact
>>> same image so I am baffled why it is running slow . I have tested OMP and
>>> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
>>> looks normal there.
>>> I thought it may have to do with cpu scaling, i.e when the kernel
>>> changes the cpu speed depending on the workload. But we do not have that
>>> enabled on these machines.
>>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
>>> our other nodes. Any suggestions on what else to check? I have tried
>>> rebooting it.
>>> processor : 19
>>> vendor_id : GenuineIntel
>>> cpu family : 6
>>> model : 62
>>> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>>> stepping : 4
>>> cpu MHz : 2500.098
>>> cache size : 25600 KB
>>> physical id : 1
>>> siblings : 10
>>> core id : 12
>>> cpu cores : 10
>>> apicid : 56
>>> initial apicid : 56
>>> fpu : yes
>>> fpu_exception : yes
>>> cpuid level : 13
>>> wp : yes
>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
>>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
>>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
>>> bogomips : 5004.97
>>> clflush size : 64
>>> cache_alignment : 64
>>> address sizes : 46 bits physical, 48 bits virtual
>>> power management:
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf