[Beowulf] Performance issue - CPU Intel 00/02

Bill Wichser bill at Princeton.EDU
Wed Jul 20 06:48:08 PDT 2005


System: 128 node Intel 2.4GHz P4
MBO: Tyan S2099, i845E
OS: RedHat 8.0, kernel  2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)

Problem: Performance one third after 60 minutes from reload/reboot on a 
number of nodes, as determined by an xhpl run

---

On SuperBowl Sunday, the lights went out on this cluster.  Ever since 
that time performance has suffered.  Initially, when running xhpl, there 
was a 3x performance difference between a "good" node and a "bad" node. 
  A reboot solved the problem, or so I thought.

This summer, having more time to investigate the problem, I found that 
some nodes exhibit this degradation after a power cycle while others 
didn't.   I've used strace, ptrace, watched memory usage statistics, etc 
but the only thing which ever changed was that all of these calls 
suffered a 3x performance hit on a bad node.

At first I thought it might be cooling, knowing that these Intel 
processors throttle down when reaching a set value.  But watching the 
temperatures revealed that all nodes were effectively running the same 
way.  And once performance dropped, they never returned to normal.

By accident I discovered that of these 128 nodes, 50 of them show some 
strange value in /proc/cpuinfo for model name.  On a good node these 
reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a 
"bad" node they call themselves "00/02" yet when checking the BIOS, and 
all the nodes have the same configuration I believe although I neglected 
to gather the level this last go round, they reveal themselves correctly 
as Intel(R) Pentium(R) 4 CPU 2.40GHz.

Now I'm stuck.  I don't know how to proceed.  I see the symptom but 
somehow find it hard to believe that 40% of the CPUs have become somehow 
defective.  Yet the software is all the same and reloads on a good node 
or a bad node produce no changes whatsoever.  Only a reboot on a bad 
node seems to cure the performance problem albeit for some short duration.

My next step will be to swap two CPUs, one from a known good into a 
known bad and see if anything changes.  But before I go that route I 
just wanted to ask the advice of this group, hoping that someone might 
have seen this before and offer a solution.

Thanks,

Bill



More information about the Beowulf mailing list