[Beowulf] recommendations for cluster upgrades

Mark Hahn hahn at mcmaster.ca
Wed May 13 22:28:03 PDT 2009

>> AMD Barcelona was the first 4 flops per cycle processor from AMD, and it hit
>> the street with some problems right when the list was coming out in end of
>> 2007.
> That's interesting. What kind of "problems"?

Barcelona had a bug in its L3 TLB logic.  you can read all about it 
through google; as mentioned, it was mostly 2h07.  there were workarounds
for this, but they cost a bit of performance.  I think I read that 
amd ultimately called it a timing issue.

bugs of this sort are pretty common, though perhaps usually smaller.
both amd and intel provide pretty decent erratum documents.  typically the
bug descriptions are not all that illuminating, but they do specify which
steppings have them, and even say whether a fix is planned...

> Do CPU designers mess up and leave bugs on too?

it would be fascinating to hear how the bug escaped pre-release testing.

unquestionably, it affected the amd/intel balance of power...
I think that if amd had managed to bring out a bugless barcelona
in mid-late 07, it would have put a serious crimp in intel's core2 sales.
especially if they had managed, early, to get a firm grip on pc2/6400.
not to mention 45nm.

> I heard of an old Intel floating point error
> but nothing else. Do later versions of CPUs get these bugfixes?

sure.  minor revisions are called "steppings", and they can include 
fairly significant if incremental improvements.

> It might change my perspective on the risks of going for a "brand new" CPU.

they're hardly ever really brand new.  core2 was a huge change for intel,
but you can see that it was clearly drived from the PIII->PM->core1 family
with some new features and lessons from the P4/netburst.  similarly, 
nehalem cores are pretty similar to current core2 cores (but not dual-die,
with smaller L3).  the uncore is the big change (mem controller, QPI).

it's pretty hard to second-guess the chip vendors in trying to figure out
whether a chip is worth the risk.  for instance, Intel's been demoing
versions of nehalem since fall 07, so there's been lots of testing.
vendors are still too closed-kimono for my taste, but they take it seriously.
for instance, there have been issues when the vendor replaces chips on 
their dime.  for the barcelona thing, it was pretty easy for amd to point
at low-overhead kernel workarounds to avoid this...

More information about the Beowulf mailing list