[Beowulf] How to debug slow compute node?

Robert Horton robh at dongle.org.uk
Thu Aug 10 08:16:51 PDT 2017


As John says, I'd start by checking the health of things like memory,
power supplies etc.

I've seen things like this which go away after a firmware update, so
I'd suggest updating the bios etc if you can.

Have you tried completely removing the power for a few minutes then
booting up again?

Any idea when the problem started? I presume from the cpu it's not a
new system. What physical form is it (1u server / blade etc)?

Rob

On Thu, 2017-08-10 at 08:39 -0600, Faraz Hussain wrote:
> One of our compute nodes runs ~30% slower than others. It has the  
> exact same image so I am baffled why it is running slow . I have  
> tested OMP and MPI benchmarks. Everything runs slower. The cpu
> usage  
> goes to 2000%, so all looks normal there.
> 
> I thought it may have to do with cpu scaling, i.e when the kernel  
> changes the cpu speed depending on the workload. But we do not have  
> that enabled on these machines.
> 
> Here is a snippet from "cat /proc/cpuinfo". Everything is identical
> to  
> our other nodes. Any suggestions on what else to check? I have
> tried  
> rebooting it.
> 
> processor       : 19
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 62
> model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> stepping        : 4
> cpu MHz         : 2500.098
> cache size      : 25600 KB
> physical id     : 1
> siblings        : 10
> core id         : 12
> cpu cores       : 10
> apicid          : 56
> initial apicid  : 56
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge  
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe  
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts  
> rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64
> monitor  
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2  
> x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
> ida  
> arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid  
> fsgsbase smep erms
> bogomips        : 5004.97
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit http:/
> /www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list