[Beowulf] How to debug slow compute node?

mathog mathog at caltech.edu
Fri Aug 11 14:42:05 PDT 2017


> Rushat Rai wrote

> I don't know if this has been mentioned, but ECC could be slowing down
> that specific node if it has a faulty stick.

To find the bad stick one often must disable ECC, at least that was the 
case many years ago the last time I ran into that.  If ECC is enabled, 
even if the stick is somewhat defective, it may still pass memtest86+.  
That utility will show if ECC is enabled or not, and the ECC disable, if 
there is one, is set in the motherboard BIOS.

I'm late to this thread, does this node have a local disk?  Failing 
disks can really slow things down if the device has to read the same 
block many times before it succeeds. That usually shows up in smartctl.

What sort of network connect?  Try swapping those cables.  Also run the 
network throughput test of your choice.  If the problem is there those 
tests will reveal it.

"sensors" should show roughly the same values as the other nodes, if 
not, figure out why.  As others have suggested that could be blocked 
ventilation,  but more often in my experience it is a fan on the way 
out.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the Beowulf mailing list