<div dir="ltr">I put €10 on the nose for a faulty power supply.<br><div class="gmail_extra"><br><div class="gmail_quote">On 10 August 2017 at 19:45, Gus Correa <span dir="ltr"><<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">+ Leftover processes from previous jobs hogging resources.<br>

That's relatively common.<br>

That can trigger swapping, the ultimate performance killer.<br>

"top" or "htop" on the node should show something.<br>

(Will go away with a reboot, of course.)<br>

<br>

Less likely, but possible:<br>

<br>

+ Different BIOS configuration w.r.t. the other nodes.<br>

<br>

+ Poorly sat memory, IB card, etc, or cable connections.<br>

<br>

+ IPMI may need a hard reset.<br>

Power down, remove the power cable, wait several minutes,<br>

put the cable back, power on.<br>

<br>

Gus Correa<span class=""><br>

<br>

On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs?<br>

Could that node have a piece of debris caught in its air intake?<br>

<br>

I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front,<br>

following careless peeople unpacking kit in machine rooms.<br>

(Firm rule - no packaging in the machine room. This means you)<br>

<br>

<br>

<br>

<br></span><span class="">

On 10 August 2017 at 17:00, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a> <mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><wbr>>> wrote:<br>

<br>

    ps.   Look at   watch  cat /proc/interrupts   also<br>

    You might get a qualitative idea of a huge rate of interrupts.<br>

<br>

<br>

    On 10 August 2017 at 16:59, John Hearns <<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><br></span><span class="">

    <mailto:<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a><wbr>>> wrote:<br>

<br>

        Faraz,<br>

            I think you might have to buy me a virtual coffee. Or a beer!<br>

        Please look at the hardware health of that machine. Specifically<br>

        the DIMMS.  I have seen this before!<br>

        If you have some DIMMS which are faulty and are generating ECC<br>

        errors, then if the mcelog service is enabled<br>

        an interrupt is generated for every ECC event. SO the system is<br>

        spending time servicing these interrupts.<br>

<br>

        So:   look in your /var/log/mcelog for hardware errors<br>

        Look in your /var/log/messages for hardware errors also<br>

        Look in the IPMI event logs for ECC errors:    ipmitool sel elist<br>

<br>

        I would also bring that node down and boot it with memtester.<br>

        If there is a DIMM which is that badly faulty then memtester<br>

        will discover it within minutes.<br>

<br>

        Or it could be something else - in which case I get no coffee.<br>

<br>

        Also Intel cluster checker is intended to exacly deal with these<br>

        situations.<br>

        What is your cluster manager, and is Intel CLuster Checker<br>

        available to you?<br>

        I would seriously look at getting this installed.<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

        On 10 August 2017 at 16:39, Faraz Hussain <<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a><br></span><div><div class="h5">

        <mailto:<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>>> wrote:<br>

<br>

            One of our compute nodes runs ~30% slower than others. It<br>

            has the exact same image so I am baffled why it is running<br>

            slow . I have tested OMP and MPI benchmarks. Everything runs<br>

            slower. The cpu usage goes to 2000%, so all looks normal there.<br>

<br>

            I thought it may have to do with cpu scaling, i.e when the<br>

            kernel changes the cpu speed depending on the workload. But<br>

            we do not have that enabled on these machines.<br>

<br>

            Here is a snippet from "cat /proc/cpuinfo". Everything is<br>

            identical to our other nodes. Any suggestions on what else<br>

            to check? I have tried rebooting it.<br>

<br>

            processor       : 19<br>

            vendor_id       : GenuineIntel<br>

            cpu family      : 6<br>

            model           : 62<br>

            model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz<br>

            stepping        : 4<br>

            cpu MHz         : 2500.098<br>

            cache size      : 25600 KB<br>

            physical id     : 1<br>

            siblings        : 10<br>

            core id         : 12<br>

            cpu cores       : 10<br>

            apicid          : 56<br>

            initial apicid  : 56<br>

            fpu             : yes<br>

            fpu_exception   : yes<br>

            cpuid level     : 13<br>

            wp              : yes<br>

            flags           : fpu vme de pse tsc msr pae mce cx8 apic<br>

            sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr<br>

            sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm<br>

            constant_tsc arch_perfmon pebs bts rep_good xtopology<br>

            nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl<br>

            vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2<br>

            x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand<br>

            lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi<br>

            flexpriority ept vpid fsgsbase smep erms<br>

            bogomips        : 5004.97<br>

            clflush size    : 64<br>

            cache_alignment : 64<br>

            address sizes   : 46 bits physical, 48 bits virtual<br>

            power management:<br>

<br>

<br>

<br>

            ______________________________<wbr>_________________<br>

            Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><br></div></div>

            <mailto:<a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a>> sponsored by Penguin Computing<span class=""><br>

            To change your subscription (digest mode or unsubscribe)<br>

            visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br></span>

            <<a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailma<wbr>n/listinfo/beowulf</a>><span class=""><br>

<br>

<br>

<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>

<br>

</span></blockquote><div class="HOEnZb"><div class="h5">

<br>

______________________________<wbr>_________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>

</div></div></blockquote></div><br></div></div>