[Beowulf] Performance issue - CPU Intel 00/02
Bill Wichser
bill at Princeton.EDU
Wed Jul 20 06:48:08 PDT 2005
System: 128 node Intel 2.4GHz P4
MBO: Tyan S2099, i845E
OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)
Problem: Performance one third after 60 minutes from reload/reboot on a
number of nodes, as determined by an xhpl run
---
On SuperBowl Sunday, the lights went out on this cluster. Ever since
that time performance has suffered. Initially, when running xhpl, there
was a 3x performance difference between a "good" node and a "bad" node.
A reboot solved the problem, or so I thought.
This summer, having more time to investigate the problem, I found that
some nodes exhibit this degradation after a power cycle while others
didn't. I've used strace, ptrace, watched memory usage statistics, etc
but the only thing which ever changed was that all of these calls
suffered a 3x performance hit on a bad node.
At first I thought it might be cooling, knowing that these Intel
processors throttle down when reaching a set value. But watching the
temperatures revealed that all nodes were effectively running the same
way. And once performance dropped, they never returned to normal.
By accident I discovered that of these 128 nodes, 50 of them show some
strange value in /proc/cpuinfo for model name. On a good node these
reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
"bad" node they call themselves "00/02" yet when checking the BIOS, and
all the nodes have the same configuration I believe although I neglected
to gather the level this last go round, they reveal themselves correctly
as Intel(R) Pentium(R) 4 CPU 2.40GHz.
Now I'm stuck. I don't know how to proceed. I see the symptom but
somehow find it hard to believe that 40% of the CPUs have become somehow
defective. Yet the software is all the same and reloads on a good node
or a bad node produce no changes whatsoever. Only a reboot on a bad
node seems to cure the performance problem albeit for some short duration.
My next step will be to swap two CPUs, one from a known good into a
known bad and see if anything changes. But before I go that route I
just wanted to ask the advice of this group, hoping that someone might
have seen this before and offer a solution.
Thanks,
Bill
More information about the Beowulf
mailing list