Opinion/experience with Intel 845E nodes?

Fri Aug 2 09:04:57 PDT 2002

OK, I have time to do a bit more detail on the 845G/E question.

Before addressing the below, I should note in response to the original
845E question that tomshardware has a review of both 845G and 845E,
gives thumbs up to G and thumbs down to E.  For what it's worth.  There
are also two G versions, and from the look of things the GL is "G light"
and probably to be avoided although there are some lovely micro ATX
motherboards that might work for some people.

On Fri, 2 Aug 2002, Ferdinand Geier wrote:

> I've also a 845G board in my Fujitsu-Siemens box, but it does not correctly 
> detect the IDE controller:
> 
> <6>Uniform Multi-Platform E-IDE driver Revision: 6.31
> <4>ide: Assuming 33MHz system bus speed for PIO modes; override with 
> idebus=xx
> <4>PCI_IDE: unknown IDE controller on PCI bus 00 device f9, VID=8086, 
> DID=24cb
> <3>PCI: Device 00:1f.1 not available because of resource collisions
> <4>PCI_IDE: chipset revision 1
> <4>PCI_IDE: not 100%% native mode: will probe irqs later
> <4>    ide0: BM-DMA at 0x2800-0x2807, BIOS settings: hda:DMA, hdb:DMA
> <4>    ide1: BM-DMA at 0x2808-0x280f, BIOS settings: hdc:DMA, hdd:DMA
> <4>hda: MAXTOR 6L080L4, ATA DISK drive
> <4>hdb: MAXTOR 6L080L4, ATA DISK drive
> <4>hdc: IDE-CD R/RW 24x12A, ATAPI CD/DVD-ROM drive
> <4>hdd: LITEON DVD-ROM LTD163, ATAPI CD/DVD-ROM drive

Mine does the same thing (with 2.4.18-5) but, as with yours, the
IDE-ATAPI driver still works and AFAICT the system is stable.  There are
certainly other messages of interest at boot.  The PCI bridge isn't
correctly identified early on, but the assumption of transparency seems
to "work" (and this happens on a lot of boards, e.g. Tyan 2466).  I'd
guess that these details (of identification and minor function tweaks)
will be straightened out by the next stable major, as this is going to
be a popular motherboard -- cheap, fast, and with damn near the whole
computer (sound, video, network) on the motherboard.  If I weren't
buried up to my nether region in scaly reptilians with sharp teeth I'd
turn the kernel list back on and check to see if it is already fixed in
bleeding edge snapshots and if not help out.  Alas (ouch), I cannot
manage that (let go, dammit!).

> With a stock 2.2.18 kernel dma could be enabled, but the SuSE kernel 
> refused to do. Maybe the chip is too new...

There is also a small chance that it is a bios issue, although I didn't
spend much time messing with it to find out.  Reading the manual, for
example (shudder:-) I see that it is quite possible that this
motherboard comes with APIC disabled by default and I see no APIC
interaction at boot time.  If I actually had a monitor plugged into mine
I'd even reboot to find out.

It has an onboard i82562ET LAN chip (working through intel's new ICH4
south bridge) that works with the eepro100 driver.  It does WOL and APCI
but alas, nothing I can find mentions PXE.  It would really suck to have
to add a redundant NIC just to get PXE on a node, especially when
(lacking 64/66 PCI on at least the implementation I have) it isn't going
to be suitable for a gigE or myrinet node in most cases -- EP to coarse
grained parallel only.

Note the following benchmarks:

r00 is a Tyan 2466 with 1900+MP Athlons.

rgb at r00|T:105>cpu_rate -t 1 -s 1000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4194304
# Time(sec): 3.13821554e-09 +/- 2.32249621e-12
#
========================================================================
# Timing test 1
# Time(sec): 1.47502246e-05 +/- 3.40095362e-08
# Samples = 100  Loop iterations per sample = 1024
#========================================================================
# Vector Double Precision Float averaged over four operations:
#    d[i] = (ad + d[i])*(bd - d[i])/d[i]
#    with d[i] = ad = bd =     3.141593
#    and vector size = 1000 (8000 bytes)
# Average Time:   3.69 nanoseconds
# BogomegaRate: 271.24 megafloats per second
rgb at r00|T:106>cpu_rate -t 1 -s 10000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4194304
# Time(sec): 3.15272570e-09 +/- 3.90137720e-12
#
========================================================================
# Timing test 1
# Time(sec): 2.47412485e-01 +/- 3.04261604e-05
# Samples = 100  Loop iterations per sample = 2
#========================================================================
# Vector Double Precision Float averaged over four operations:
#    d[i] = (ad + d[i])*(bd - d[i])/d[i]
#    with d[i] = ad = bd =     3.141593
#    and vector size = 10000000 (80000000 bytes)
# Average Time:   6.19 nanoseconds
# BogomegaRate: 161.67 megafloats per second
50.670user 0.090sys 91.9%, 0ib 0ob 0tx 0da 0to 0swp 0:55.23

Note the strong differential in performance between in-cache (-s 1000)
and out of memory (-s 10^7 = 8x10^7 bytes in the vector).

rgb at lucifer2|T:123>cpu_rate -t 1 -s 1000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4194304
# Time(sec): 3.32885504e-09 +/- 1.54754165e-14
#
========================================================================
# Timing test 1
# Time(sec): 2.38647266e-05 +/- 6.61387967e-10
# Samples = 100  Loop iterations per sample = 512
#========================================================================
# Vector Double Precision Float averaged over four operations:
#    d[i] = (ad + d[i])*(bd - d[i])/d[i]
#    with d[i] = ad = bd =     3.141593
#    and vector size = 1000 (8000 bytes)
# Size:         1000  Vector Length (bytes):         8000
# Average Time:   5.97 nanoseconds
# BogomegaRate: 167.63 megafloats per second
rgb at lucifer2|T:123>cpu_rate -t 1 -s 10000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 4194304
# Time(sec): 3.32885265e-09 +/- 1.04886505e-14
#
========================================================================
# Timing test 1
# Time(sec): 2.40166545e-01 +/- 4.82193280e-05
# Samples = 100  Loop iterations per sample = 2
#========================================================================
# Vector Double Precision Float averaged over four operations:
#    d[i] = (ad + d[i])*(bd - d[i])/d[i]
#    with d[i] = ad = bd =     3.141593
#    and vector size = 10000000 (80000000 bytes)
# Size:     10000000  Vector Length (bytes):     80000000
# Average Time:   6.00 nanoseconds
# BogomegaRate: 166.55 megafloats per second
49.480user 0.180sys 91.7%, 0ib 0ob 0tx 0da 0to 0swp 0:54.13

Note the nearly flat performance out of cache and memory.  Odd, no?

On the other hand, stream:

r00              
# Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         605.4189       0.0265       0.0264       0.0266
Scale:        673.5707       0.0238       0.0238       0.0238
Add:          780.8441       0.0309       0.0307       0.0323
Triad:        640.4618       0.0375       0.0375       0.0376

lucifer2
# Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         993.2930       0.0162       0.0161       0.0164
Scale:       1009.9076       0.0159       0.0158       0.0159
Add:         1130.6354       0.0212       0.0212       0.0213
Triad:       1126.3845       0.0213       0.0213       0.0214

...an impressive difference.  One last benchmark I like to run is my
Monte Carlo code, at a fixed size (the only benchmark that "matters",
really:-).  I've been running this for many years and thus have an
excellent historical record of its performance on things from a
Sparcstation 1 on.  It tends to be CPU bound, not memory bound, and
generally scales well with CPU clock within a processor family.

Here I see a real anomaly:

#============================================================
# Benchmark run of On_spin3d on host ganesh (Mark III)
# CPU = 933 MHz PIII, Total RAM = 128 MB
# L = 16
# Time = 22.64user 0.00system 0:22.74elapsed
#============================================================
# Benchmark run of On_spin3d on host eve
# CPU = 800 MHz Athlon Tbird, Total RAM = 64
# L = 16
# Time = 25.870user 0.000system 0:25.895elapsed
#============================================================
# Benchmark run of On_spin3d on host lucifer (lucifer, Mark III)
# CPU = 1800 MHz P4, Total RAM = 512MB
# L = 16
# Time = 17.760user 0.020system 0:18.53elapsed
#============================================================
# Benchmark run of On_spin3d on host r00
# CPU = 1600.084 MHz Athlon (1900+MP), Total RAM = 1024MB
# L = 16
# Time = 13.160user 0.030sys 0:13.19elapsed

The Athlon scales (as expected) nearly perfectly with clock -- an 800
MHz Tbird takes twice as long as a 1600 MHz 1900+MP (pause for a Grrr at
their silly numbering scheme).  The 1800 MHz P4, on the other hand, is
only about 25% faster than a 933 MHz P3!  This is so unbelievable that I
recompiled, checked that I was using the same sources, ran it on a
couple of P4's (one of which I didn't configure).  It seems consistent,
and regardless of what user/system might say, wall clock does not lie.

So go figure.  The obvious moral of THIS story is assume makes an Ass
out of U and Me (as my wife the doctor likes to say).  Every P6 CPU from
the PPro through the P3, including the Celeron, scaled on this
application with clock:

# CPU = dual 200 MHz PentiumPro, Total RAM = 128 MB
# L = 16
# Time = 97.35user 0.06system 1:38.13elapsed

(933/200 = 4.665, 97.35/22.64 = 4.300, close enough for goverment
work:-).  The P4 does not, and it is obviously not MEMORY bound as the
memory on lucifer2 screams (and besides, the other P4 I tested had
different memory and motherboard altogether).

So be sure to TEST YOUR APPLICATION on ANY new CPU and do not assume
that non-application benchmarks mean a damn thing.  I don't know what
feature of the P4 is killing my code (although I'm tempted to compile
with profiling and find out) but SOMETHING it is doing is scaling
terribly indeed with clock relative to the earlier P6-core designs.

Nevertheless say that the 845G system works more than well enough to be
a cheap, fast compute node (especially for memory bound problems that
can maximally benefit from its very fast memory features) is "working"
(with this unknown control problem and an also yet-unsupported onboard
video -- beyond VGA mode -- and sound) well enough to be a functional
desktop, and is bound to be totally supported (clean boot, functioning
sound and XFree86 on the motherboard) by September at the outside.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu