[Beowulf] How to debug slow compute node?

Andrew Latham lathama at gmail.com
Thu Aug 10 11:28:30 PDT 2017


In general if you have a snowflake you need to take some steps.
1. Unrack and remove it from the population
2. Image, document the system
3. Sniff test, visual test, power on fans spinning test in a lab
4. Understand that it is ok for one system out of X (where X could be 1000)
can fail
5. Return the system to rack if drive/image replacement resolves issue
6. Return system to supplier if above fails
7. Keep moving, don't spend the hours that equate to the cost of the node
troubleshooting it unless capital budget is super tricky
8. Keep dialog with supplier all the time to say that everything is awesome
so they are interested in the change of status
9. Don't troubleshoot in production ever....

On Thu, Aug 10, 2017 at 9:39 AM, Faraz Hussain <info at feacluster.com> wrote:

> One of our compute nodes runs ~30% slower than others. It has the exact
> same image so I am baffled why it is running slow . I have tested OMP and
> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
> looks normal there.
>
> I thought it may have to do with cpu scaling, i.e when the kernel changes
> the cpu speed depending on the workload. But we do not have that enabled on
> these machines.
>
> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our
> other nodes. Any suggestions on what else to check? I have tried rebooting
> it.
>
> processor       : 19
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 62
> model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> stepping        : 4
> cpu MHz         : 2500.098
> cache size      : 25600 KB
> physical id     : 1
> siblings        : 10
> core id         : 12
> cpu cores       : 10
> apicid          : 56
> initial apicid  : 56
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
> bogomips        : 5004.97
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>



-- 
- Andrew "lathama" Latham lathama at gmail.com http://lathama.com
<http://lathama.org> -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170810/333ec523/attachment.html>


More information about the Beowulf mailing list