<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style></head><body lang=en-NL link=blue vlink="#954F72"><div class=WordSection1><p class=MsoNormal><span lang=EN-GB>Ten euros for me on a faulty DIMM<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Sent from <a href="https://go.microsoft.com/fwlink/?LinkId=550986">Mail</a> for Windows 10</p><p class=MsoNormal><o:p> </o:p></p><div style='mso-element:para-border-div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal style='border:none;padding:0cm'><b>From: </b><a href="mailto:andrew.holway@gmail.com">Andrew Holway</a><br><b>Sent: </b>Thursday, 10 August 2017 20:05<br><b>To: </b><a href="mailto:gus@ldeo.columbia.edu">Gus Correa</a><br><b>Cc: </b><a href="mailto:beowulf@beowulf.org">Beowulf Mailing List</a><br><b>Subject: </b>Re: [Beowulf] How to debug slow compute node?</p></div><p class=MsoNormal><o:p> </o:p></p><div><p class=MsoNormal>I put €10 on the nose for a faulty power supply.</p><div><p class=MsoNormal><o:p> </o:p></p><div><p class=MsoNormal>On 10 August 2017 at 19:45, Gus Correa <<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>> wrote:</p><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm'><p class=MsoNormal>+ Leftover processes from previous jobs hogging resources.<br>That's relatively common.<br>That can trigger swapping, the ultimate performance killer.<br>"top" or "htop" on the node should show something.<br>(Will go away with a reboot, of course.)<br><br>Less likely, but possible:<br><br>+ Different BIOS configuration w.r.t. the other nodes.<br><br>+ Poorly sat memory, IB card, etc, or cable connections.<br><br>+ IPMI may need a hard reset.<br>Power down, remove the power cable, wait several minutes,<br>put the cable back, power on.<br><br>Gus Correa<br><br>On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:</p><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm'><p class=MsoNormal>Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs?<br>Could that node have a piece of debris caught in its air intake?<br><br>I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front,<br>following careless peeople unpacking kit in machine rooms.<br>(Firm rule - no packaging in the machine room. This means you)<br><br><br><br><br>On 10 August 2017 at 17:00, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a> <mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>>> wrote:<br><br> ps. Look at watch cat /proc/interrupts also<br> You might get a qualitative idea of a huge rate of interrupts.<br><br><br> On 10 August 2017 at 16:59, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><br> <mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>>> wrote:<br><br> Faraz,<br> I think you might have to buy me a virtual coffee. Or a beer!<br> Please look at the hardware health of that machine. Specifically<br> the DIMMS. I have seen this before!<br> If you have some DIMMS which are faulty and are generating ECC<br> errors, then if the mcelog service is enabled<br> an interrupt is generated for every ECC event. SO the system is<br> spending time servicing these interrupts.<br><br> So: look in your /var/log/mcelog for hardware errors<br> Look in your /var/log/messages for hardware errors also<br> Look in the IPMI event logs for ECC errors: ipmitool sel elist<br><br> I would also bring that node down and boot it with memtester.<br> If there is a DIMM which is that badly faulty then memtester<br> will discover it within minutes.<br><br> Or it could be something else - in which case I get no coffee.<br><br> Also Intel cluster checker is intended to exacly deal with these<br> situations.<br> What is your cluster manager, and is Intel CLuster Checker<br> available to you?<br> I would seriously look at getting this installed.<br><br><br><br><br><br><br><br> On 10 August 2017 at 16:39, Faraz Hussain <<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a></p><div><div><p class=MsoNormal> <mailto:<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>>> wrote:<br><br> One of our compute nodes runs ~30% slower than others. It<br> has the exact same image so I am baffled why it is running<br> slow . I have tested OMP and MPI benchmarks. Everything runs<br> slower. The cpu usage goes to 2000%, so all looks normal there.<br><br> I thought it may have to do with cpu scaling, i.e when the<br> kernel changes the cpu speed depending on the workload. But<br> we do not have that enabled on these machines.<br><br> Here is a snippet from "cat /proc/cpuinfo". Everything is<br> identical to our other nodes. Any suggestions on what else<br> to check? I have tried rebooting it.<br><br> processor : 19<br> vendor_id : GenuineIntel<br> cpu family : 6<br> model : 62<br> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz<br> stepping : 4<br> cpu MHz : 2500.098<br> cache size : 25600 KB<br> physical id : 1<br> siblings : 10<br> core id : 12<br> cpu cores : 10<br> apicid : 56<br> initial apicid : 56<br> fpu : yes<br> fpu_exception : yes<br> cpuid level : 13<br> wp : yes<br> flags : fpu vme de pse tsc msr pae mce cx8 apic<br> sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr<br> sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm<br> constant_tsc arch_perfmon pebs bts rep_good xtopology<br> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl<br> vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2<br> x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand<br> lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi<br> flexpriority ept vpid fsgsbase smep erms<br> bogomips : 5004.97<br> clflush size : 64<br> cache_alignment : 64<br> address sizes : 46 bits physical, 48 bits virtual<br> power management:<br><br><br><br> _______________________________________________<br> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a></p></div></div><p class=MsoNormal style='margin-bottom:12.0pt'> <mailto:<a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a>> sponsored by Penguin Computing<br> To change your subscription (digest mode or unsubscribe)<br> visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br> <<a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a>><br><br><br><br><br><br><br>_______________________________________________<br>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a></p></blockquote><div><div><p class=MsoNormal><br>_______________________________________________<br>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a></p></div></div></blockquote></div></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><o:p> </o:p></p></div></body></html>