[Beowulf] How to debug slow compute node?

John Hearns hearnsj at googlemail.com
Thu Aug 10 08:00:11 PDT 2017


ps.   Look at   watch  cat /proc/interrupts   also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns <hearnsj at googlemail.com> wrote:

> Faraz,
>    I think you might have to buy me a virtual coffee. Or a beer!
> Please look at the hardware health of that machine. Specifically the
> DIMMS.  I have seen this before!
> If you have some DIMMS which are faulty and are generating ECC errors,
> then if the mcelog service is enabled
> an interrupt is generated for every ECC event. SO the system is spending
> time servicing these interrupts.
>
> So:   look in your /var/log/mcelog for hardware errors
> Look in your /var/log/messages for hardware errors also
> Look in the IPMI event logs for ECC errors:    ipmitool sel elist
>
> I would also bring that node down and boot it with memtester.
> If there is a DIMM which is that badly faulty then memtester will discover
> it within minutes.
>
> Or it could be something else - in which case I get no coffee.
>
> Also Intel cluster checker is intended to exacly deal with these
> situations.
> What is your cluster manager, and is Intel CLuster Checker available to
> you?
> I would seriously look at getting this installed.
>
>
>
>
>
>
>
> On 10 August 2017 at 16:39, Faraz Hussain <info at feacluster.com> wrote:
>
>> One of our compute nodes runs ~30% slower than others. It has the exact
>> same image so I am baffled why it is running slow . I have tested OMP and
>> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
>> looks normal there.
>>
>> I thought it may have to do with cpu scaling, i.e when the kernel changes
>> the cpu speed depending on the workload. But we do not have that enabled on
>> these machines.
>>
>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
>> our other nodes. Any suggestions on what else to check? I have tried
>> rebooting it.
>>
>> processor       : 19
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 62
>> model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>> stepping        : 4
>> cpu MHz         : 2500.098
>> cache size      : 25600 KB
>> physical id     : 1
>> siblings        : 10
>> core id         : 12
>> cpu cores       : 10
>> apicid          : 56
>> initial apicid  : 56
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 13
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
>> bogomips        : 5004.97
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 46 bits physical, 48 bits virtual
>> power management:
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170810/5f4e4642/attachment.html>


More information about the Beowulf mailing list