<div dir="ltr">I put €10 on the nose for a faulty power supply.<br><div class="gmail_extra"><br><div class="gmail_quote">On 10 August 2017 at 19:45, Gus Correa <span dir="ltr"><<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">+ Leftover processes from previous jobs hogging resources.<br>
That's relatively common.<br>
That can trigger swapping, the ultimate performance killer.<br>
"top" or "htop" on the node should show something.<br>
(Will go away with a reboot, of course.)<br>
<br>
Less likely, but possible:<br>
<br>
+ Different BIOS configuration w.r.t. the other nodes.<br>
<br>
+ Poorly sat memory, IB card, etc, or cable connections.<br>
<br>
+ IPMI may need a hard reset.<br>
Power down, remove the power cable, wait several minutes,<br>
put the cable back, power on.<br>
<br>
Gus Correa<span class=""><br>
<br>
On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs?<br>
Could that node have a piece of debris caught in its air intake?<br>
<br>
I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front,<br>
following careless peeople unpacking kit in machine rooms.<br>
(Firm rule - no packaging in the machine room. This means you)<br>
<br>
<br>
<br>
<br></span><span class="">
On 10 August 2017 at 17:00, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a> <mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><wbr>>> wrote:<br>
<br>
ps. Look at watch cat /proc/interrupts also<br>
You might get a qualitative idea of a huge rate of interrupts.<br>
<br>
<br>
On 10 August 2017 at 16:59, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><br></span><span class="">
<mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><wbr>>> wrote:<br>
<br>
Faraz,<br>
I think you might have to buy me a virtual coffee. Or a beer!<br>
Please look at the hardware health of that machine. Specifically<br>
the DIMMS. I have seen this before!<br>
If you have some DIMMS which are faulty and are generating ECC<br>
errors, then if the mcelog service is enabled<br>
an interrupt is generated for every ECC event. SO the system is<br>
spending time servicing these interrupts.<br>
<br>
So: look in your /var/log/mcelog for hardware errors<br>
Look in your /var/log/messages for hardware errors also<br>
Look in the IPMI event logs for ECC errors: ipmitool sel elist<br>
<br>
I would also bring that node down and boot it with memtester.<br>
If there is a DIMM which is that badly faulty then memtester<br>
will discover it within minutes.<br>
<br>
Or it could be something else - in which case I get no coffee.<br>
<br>
Also Intel cluster checker is intended to exacly deal with these<br>
situations.<br>
What is your cluster manager, and is Intel CLuster Checker<br>
available to you?<br>
I would seriously look at getting this installed.<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
On 10 August 2017 at 16:39, Faraz Hussain <<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a><br></span><div><div class="h5">
<mailto:<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>>> wrote:<br>
<br>
One of our compute nodes runs ~30% slower than others. It<br>
has the exact same image so I am baffled why it is running<br>
slow . I have tested OMP and MPI benchmarks. Everything runs<br>
slower. The cpu usage goes to 2000%, so all looks normal there.<br>
<br>
I thought it may have to do with cpu scaling, i.e when the<br>
kernel changes the cpu speed depending on the workload. But<br>
we do not have that enabled on these machines.<br>
<br>
Here is a snippet from "cat /proc/cpuinfo". Everything is<br>
identical to our other nodes. Any suggestions on what else<br>
to check? I have tried rebooting it.<br>
<br>
processor : 19<br>
vendor_id : GenuineIntel<br>
cpu family : 6<br>
model : 62<br>
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz<br>
stepping : 4<br>
cpu MHz : 2500.098<br>
cache size : 25600 KB<br>
physical id : 1<br>
siblings : 10<br>
core id : 12<br>
cpu cores : 10<br>
apicid : 56<br>
initial apicid : 56<br>
fpu : yes<br>
fpu_exception : yes<br>
cpuid level : 13<br>
wp : yes<br>
flags : fpu vme de pse tsc msr pae mce cx8 apic<br>
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr<br>
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm<br>
constant_tsc arch_perfmon pebs bts rep_good xtopology<br>
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl<br>
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2<br>
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand<br>
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi<br>
flexpriority ept vpid fsgsbase smep erms<br>
bogomips : 5004.97<br>
clflush size : 64<br>
cache_alignment : 64<br>
address sizes : 46 bits physical, 48 bits virtual<br>
power management:<br>
<br>
<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><br></div></div>
<mailto:<a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a>> sponsored by Penguin Computing<span class=""><br>
To change your subscription (digest mode or unsubscribe)<br>
visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br></span>
<<a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailma<wbr>n/listinfo/beowulf</a>><span class=""><br>
<br>
<br>
<br>
<br>
<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
<br>
</span></blockquote><div class="HOEnZb"><div class="h5">
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div></div>