[Beowulf] How to debug slow compute node?
Faraz Hussain
info at feacluster.com
Thu Aug 10 11:29:46 PDT 2017
Thanks for the tips! Unfortunately, I am not seeing anything in
/var/log of interest. The mcelog service is not enabled. I do not see
anything /proc/interrupts either.
I will look into full power down , memtester and firmare update. It is
a blade. We do not have Intel cluster checker, but we have DRAC ( Dell
Remote Access Controller ). I just logged in there and everything
checks out, i.e memory, power etc.
Quoting John Hearns via Beowulf <beowulf at beowulf.org>:
> Another thing to perhaps look at. Are you seeing messages abotu thermal
> throttling events in the system logs?
> Could that node have a piece of debris caught in its air intake?
>
> I dont think that will produce a 30% drop in perfoemance. But I have caught
> compute nodes with pieces of packaking sucked onto the front,
> following careless peeople unpacking kit in machine rooms.
> (Firm rule - no packaging in the machine room. This means you)
>
>
>
>
> On 10 August 2017 at 17:00, John Hearns <hearnsj at googlemail.com> wrote:
>
>> ps. Look at watch cat /proc/interrupts also
>> You might get a qualitative idea of a huge rate of interrupts.
>>
>>
>> On 10 August 2017 at 16:59, John Hearns <hearnsj at googlemail.com> wrote:
>>
>>> Faraz,
>>> I think you might have to buy me a virtual coffee. Or a beer!
>>> Please look at the hardware health of that machine. Specifically the
>>> DIMMS. I have seen this before!
>>> If you have some DIMMS which are faulty and are generating ECC errors,
>>> then if the mcelog service is enabled
>>> an interrupt is generated for every ECC event. SO the system is spending
>>> time servicing these interrupts.
>>>
>>> So: look in your /var/log/mcelog for hardware errors
>>> Look in your /var/log/messages for hardware errors also
>>> Look in the IPMI event logs for ECC errors: ipmitool sel elist
>>>
>>> I would also bring that node down and boot it with memtester.
>>> If there is a DIMM which is that badly faulty then memtester will
>>> discover it within minutes.
>>>
>>> Or it could be something else - in which case I get no coffee.
>>>
>>> Also Intel cluster checker is intended to exacly deal with these
>>> situations.
>>> What is your cluster manager, and is Intel CLuster Checker available to
>>> you?
>>> I would seriously look at getting this installed.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 10 August 2017 at 16:39, Faraz Hussain <info at feacluster.com> wrote:
>>>
>>>> One of our compute nodes runs ~30% slower than others. It has the exact
>>>> same image so I am baffled why it is running slow . I have tested OMP and
>>>> MPI benchmarks. Everything runs slower. The cpu usage goes to
>>>> 2000%, so all
>>>> looks normal there.
>>>>
>>>> I thought it may have to do with cpu scaling, i.e when the kernel
>>>> changes the cpu speed depending on the workload. But we do not have that
>>>> enabled on these machines.
>>>>
>>>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
>>>> our other nodes. Any suggestions on what else to check? I have tried
>>>> rebooting it.
>>>>
>>>> processor : 19
>>>> vendor_id : GenuineIntel
>>>> cpu family : 6
>>>> model : 62
>>>> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>>>> stepping : 4
>>>> cpu MHz : 2500.098
>>>> cache size : 25600 KB
>>>> physical id : 1
>>>> siblings : 10
>>>> core id : 12
>>>> cpu cores : 10
>>>> apicid : 56
>>>> initial apicid : 56
>>>> fpu : yes
>>>> fpu_exception : yes
>>>> cpuid level : 13
>>>> wp : yes
>>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>>>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>>>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
>>>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
>>>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
>>>> bogomips : 5004.97
>>>> clflush size : 64
>>>> cache_alignment : 64
>>>> address sizes : 46 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>
>>>
>>
More information about the Beowulf
mailing list