Beowulf digest, Vol 1 #967 - 7 msgs

Ravi Soundararajan ravi at angstrom.com
Wed Jul 31 12:56:37 PDT 2002


We have seen a large number of issues with the Gigabyte dual Athlon boards (GA-
7DPXDW). 

Here is our configuration:
1. Dual AMD Athlon MP2000+
2. 2 40GB Seagate ST340016A drives (IDE)
3. 1 Gigabit ethernet NIC
4. 3.5GB memory (Virtium modules, 3 1GB modules and 1 512MB module, all 64x4 
256Mbit chip stacks)
5. 400W power supply

Some of the problems we have seen:
1. Sporadic booting. Some machines booted perfectly everytime, and others 
booted only 50% of the time. There did not seem to be any real correlation with 
power supply or memory modules: changing either didn't fix the problem. 
Sometimes, moving a machine from the rack to a desk fixed the problem, and then 
moving it back to the rack caused the problem to resurface. Sometimes, changing 
around the order of the memory modules helped, and other times it did not.

2. ECC errors. We saw a large number of ECC errors in these machines. Changing 
the memory modules often helped, but after a few days of running, the problems 
would reappear. Sometimes, the problem was persistent: even if software was 
written to reset the ECC bits appropriately, the bits didn't appear to be 
changed, and the error reappeared.

3. DMA IRQ errors. After many hours of burning, there would be some disk issue 
related to dma_irq. We had seen many errors and crashes after 2 days or so, but 
most of these were linked to a bug in the 2.4.18-3smp kernel, and were fixed by 
upgrading to 2.4.18-5.

4. Console redirection issues. We noticed that a number of machines would stop 
responding to the console after a few days. They were otherwise burning fine 
(you could ssh to them, for example), but the console seemed to have stopped 
responding. You might take it off the rack and the console would respond again, 
then put it on the rack and it would be fine. Then a few days later, it would 
stop responding again.

We seemed to have better success with the 2.4.7 or 2.4.9 kernel rather than 
2.4.18, although a number of the above errors would crop up no matter which 
kernel we used.

-Ravi

Quoting beowulf-request at beowulf.org:

> Send Beowulf mailing list submissions to
> 	beowulf at beowulf.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.beowulf.org/mailman/listinfo/beowulf
> or, via email, send a message with subject or body 'help' to
> 	beowulf-request at beowulf.org
> 
> You can reach the person managing the list at
> 	beowulf-admin at beowulf.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Beowulf digest..."
> 
> 
> Today's Topics:
> 
>    1. Re:Is there any work management tools like that. (Donald Becker)
>    2. Re:Gentoo and Beowulf-ish clusters (Dean Johnson)
>    3. Problems with dual Athlons (Manel Soria)
>    4. Re:Problems with dual Athlons (Ray Schwamberger)
>    5. Re:Problems with dual Athlons (Alberto Ramos)
>    6. Re:Problems with dual Athlons (Mark Hahn)
>    7. Re:Problems with dual Athlons (Robert G. Brown)
> 
> --__--__--
> 
> Message: 1
> Date: Tue, 30 Jul 2002 12:01:54 -0400 (EDT)
> From: Donald Becker <becker at scyld.com>
> To: William Thies <samsarazeal at yahoo.com>
> cc: beowulf at beowulf.org
> Subject: Re: Is there any work management tools like that.
> 
> On Tue, 30 Jul 2002, William Thies wrote:
> 
> > We need such kind of work management tools working on
> > a 32-node cluster.
> ..
> > 1. We will always run a very large master-slave
> > program on this cluster.
> ..
> > 2. Sometimes, we need to use this cluster to do other
> > works. 
> 
> Most any scheduling system can handle this kind of job allocation, at
> least for new jobs.
> 
> The devil is in the details.  For the large job workload, is that job
> a
> number of short-lived independent processes, or a single
> job with many long-lived communicating processes?
> 
> > (1) We want to power off 8 nodes first,
> 
> Why power off?  You can use WOL or IPMI, but that power-cycle will
> take
> on the order of minutes -- far longer than scheduling, and
> significantly
> longer than other approaches to clearing the machine state.  The Scyld
> system can clear the machine state in just a few seconds.
> 
> > And at that time we don't want the GA program to use those 8 nodes
> 
> Every scheduling system can prevent jobs #1 from allocating new
> processes on the reserved nodes.  The question is, what happens to
> the processes of job #1?
>     Are they short-lived enough that they will terminate naturally in
> a
>       few seconds?
>     Can the slave processes just be suspended?
>     Do you expect the system to check-point and restart them later?
>      (If so, what about the non-check-pointed processes they are
>       communicating with?)
>     Do you expect the system to migrate them to another node?
>      (Again, what are you communication expectations?)
>     Can the processes be signalled to check-point or migrate itself?
>       (Scyld Beowulf provides tools to make this very easy, but it's
> not
>        a common feature on other scheduling system.)
> 
> > 3. This should be a multi-user management tool.
> > Would you like to recommend some tools like that?
> > Thanks very much!
> 
> We provide a queuing, scheduling and node allocation systems(*) that
> can
> accomplish this within a cluster.  If you need site-wide scheduling
> (multiple OSes, a mix of cluster and independant nodes, crossing
> firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.
> 
> 
> -- 
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
> Annapolis MD 21403			410-990-9993
> 
> 
> --__--__--
> 
> Message: 2
> Subject: Re: Gentoo and Beowulf-ish clusters
> From: Dean Johnson <dtj at uberh4x0r.org>
> To: Andrew Fant <fant at pobox.com>
> Cc: beowulf at beowulf.org
> Date: 30 Jul 2002 21:10:21 -0500
> 
> On Mon, 2002-07-29 at 17:45, Andrew Fant wrote:
> > Evening all,
> >   Has anyone got any experience using Gentoo as the base distro for
> a
> > Linux cluster?  For various reasons (both technical and political),
> RedHat
> > is not a particulary viable option on this project, and since I have
> been
> > so happy with the results of using Gentoo on a couple of smaller
> systems,
> > it seems like an option to consider.  Thanks to all in advance for
> any
> > information.
> > 
> 
> Most of the typical cluster software that you would use (mpich, etc)
> shouldn't present any sort of problems, apart from perhaps just having
> to build them all, which I suspect most people do anyways. From what I
> know of gentoo, it isn't THAT different to cause a problem. Most of
> the
> important cluster software has been ported to many OSes, so it is
> pretty
> asymtotic to nerdvana wrt porting.
> 
> Relating to that, I just built ganglia under solaris and it built and
> worked totally as advertised after, of course, I went through all the
> hassle of getting gcc and such installed on the solaris box.
> 
> 	-Dean
> 
> 
> --__--__--
> 
> Message: 3
> Date: Wed, 31 Jul 2002 10:17:16 +0200
> From: Manel Soria <manel at labtie.mmt.upc.es>
> Organization: UPC - Laboratori de Termotecnia i Energetica
> To: beowulf at beowulf.org
> Subject: Problems with dual Athlons
> 
> Hi,
> 
> Please let me report a problem that we have  in our cluster with dual
> Athlons
> in case that somebody can help us.
> 
> We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> Two of them crash frequently and the other two run fine.
> We have tried to replace different hardware  components and
> desactivate
> the SMP option but the problem persists.
> 
> The main difference between them is that the systems that crash
> (the servers) have two network interfaces while the systems that run
> fine (normal nodes) have only one network interface.
> Can this be the cause of the problem ?  Would it be a good idea to use
> another version of gcc ?
> 
> The motherboard is an ASUS AM7M266-D. One of the systems that
> crashes is running Debian 2.1  and the other Debian 2.2. The systems
> that don't crash run Debian 2.1.
> 
> "Crash" here means that the VGA display is blank and the system has to
> be reseted. There is no other relevant message.
> 
> Thanks
> 
> 
> --
> ===============================================
> Dr. Manel Soria
> ETSEIT - Centre Tecnologic de Transferencia de Calor
> C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> E-Mail: manel at labtie.mmt.upc.es
> 
> 
> 
> 
> --__--__--
> 
> Message: 4
> Date: Wed, 31 Jul 2002 08:33:56 -0500
> From: Ray Schwamberger <ray at advancedclustering.com>
> Reply-To: ray at advancedclustering.com
> To: beowulf at beowulf.org
> Subject: Re: Problems with dual Athlons
> 
> I've seen similar issues with some dual athlon systems, the issues as I
> 
> best can sort them from playing around with options...
> 
> 1)  OS/kernel version does seem to make a difference.  The same systems
> 
> that run Red Hat 7.2 with perfect stability would crash within minutes
> 
> with complaints about interrupt handlers and modprobe errors for the 
> binfmt-0000 module, usually trashing their filesystems in the process.
> 
> Perhaps 2.4.18 is not a good choice for running dual Athlons, I've had
> 
> very limited time to play with this idea but that is the largest 
> coincidence I've managed to see so far.
> 
> 2) While doing channel bonding, we were getting very uneven transfer 
> across the bonded interfaces.  Once again this was dual athlons and 
> using the newer, supposedly SMP-safe bonding driver.  The same machines
> 
> running on a uni-processor kernel showed no issues at all, therefore it
> 
> had to be a SMP issue.  Using the 'noapic'  kernel option at boot time
> 
> smoothed this one out, but again it points at something in the newest 
> kernels not agreeing readily with dual athlons, or perhaps the 762/768
> 
> chipset combination.
> 
> You might try the noapic option. I'm thinking there may be some kind of
> 
> issues with APIC, AMD and 2.4.18.
> 
> 
> Manel Soria wrote:
> > Hi,
> > 
> > Please let me report a problem that we have  in our cluster with dual
> Athlons
> > in case that somebody can help us.
> > 
> > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> > Two of them crash frequently and the other two run fine.
> > We have tried to replace different hardware  components and
> desactivate
> > the SMP option but the problem persists.
> > 
> > The main difference between them is that the systems that crash
> > (the servers) have two network interfaces while the systems that run
> > fine (normal nodes) have only one network interface.
> > Can this be the cause of the problem ?  Would it be a good idea to
> use
> > another version of gcc ?
> > 
> > The motherboard is an ASUS AM7M266-D. One of the systems that
> > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > that don't crash run Debian 2.1.
> > 
> > "Crash" here means that the VGA display is blank and the system has
> to
> > be reseted. There is no other relevant message.
> > 
> > Thanks
> > 
> > 
> > --
> > ===============================================
> > Dr. Manel Soria
> > ETSEIT - Centre Tecnologic de Transferencia de Calor
> > C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> > Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> > E-Mail: manel at labtie.mmt.upc.es
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 
> --__--__--
> 
> Message: 5
> Date: Wed, 31 Jul 2002 16:33:43 +0200
> To: Manel Soria <manel at labtie.mmt.upc.es>
> Cc: Lista de correo sobre Beowulf <beowulf at beowulf.org>
> Subject: Re: Problems with dual Athlons
> From: Alberto Ramos <alberto at delta.ft.uam.es>
> 
> 
>   Hi,
>   
>   I dont know if this will help you, but we have a little beowulf of
> dual
> Athlon, with Tyan S2466N-4M Motherboards running Debian 3.0 (Woody) and
> gcc
> 2.95.4 without any problem.
> 
>   Dou you have "good" memory? With our motherboard we have lots of
> problems
> with the memory...
> 
>   Other question is if the system crash when they are hight loaded or
> this
> has othing to do.
> 
>   Alberto.
> 
> 
> On Wed, Jul 31, 2002 at 10:17:16AM +0200, Manel Soria wrote:
> > Hi,
> > 
> > Please let me report a problem that we have  in our cluster with dual
> Athlons
> > in case that somebody can help us.
> > 
> > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> > Two of them crash frequently and the other two run fine.
> > We have tried to replace different hardware  components and
> desactivate
> > the SMP option but the problem persists.
> > 
> > The main difference between them is that the systems that crash
> > (the servers) have two network interfaces while the systems that run
> > fine (normal nodes) have only one network interface.
> > Can this be the cause of the problem ?  Would it be a good idea to
> use
> > another version of gcc ?
> > 
> > The motherboard is an ASUS AM7M266-D. One of the systems that
> > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > that don't crash run Debian 2.1.
> > 
> > "Crash" here means that the VGA display is blank and the system has
> to
> > be reseted. There is no other relevant message.
> > 
> > Thanks
> > 
> > 
> > --
> > ===============================================
> > Dr. Manel Soria
> > ETSEIT - Centre Tecnologic de Transferencia de Calor
> > C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> > Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> > E-Mail: manel at labtie.mmt.upc.es
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> --__--__--
> 
> Message: 6
> Date: Wed, 31 Jul 2002 11:15:12 -0400 (EDT)
> From: Mark Hahn <hahn at physics.mcmaster.ca>
> To: Manel Soria <manel at labtie.mmt.upc.es>
> cc: <beowulf at beowulf.org>
> Subject: Re: Problems with dual Athlons
> 
> > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> 
> it's definately worth your while to try a more recent kernel
> (ie, 2.4.19-rc3, possibly the latest ac or aa version.)
> 
> > Two of them crash frequently and the other two run fine.
> > We have tried to replace different hardware  components and
> desactivate
> > the SMP option but the problem persists.
> 
> how seriously have you addressed the hardware explanation?
> for instance, have you verified that the CPU fans are mounted
> properly?  is there any temperature correlation to when the crashes
> happen?  do you have a reason to believe the dimms are good?
> how about bios settings (esp wrt memory timings) and/or bios versions?
> how about power supplies?  it's useful to have a "monster" 450W
> PS from a name-brand like Enermax around that you know is good, 
> but really only use for testing.
> 
> > The main difference between them is that the systems that crash
> > (the servers) have two network interfaces while the systems that run
> > fine (normal nodes) have only one network interface.
> 
> bonding?  incidentally, are you using MPS 1.4 and kernel apic support?
> 
> > Can this be the cause of the problem ?  Would it be a good idea to
> use
> > another version of gcc ?
> 
> 2.95.2 is still recommended for 2.4 I believe.  I recall AC saying that
> 
> it had some trouble with 2.5 though.
> 
> > The motherboard is an ASUS AM7M266-D. One of the systems that
> > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > that don't crash run Debian 2.1.
> 
> I don't see why userspace would matter.
> 
> > "Crash" here means that the VGA display is blank and the system has
> to
> > be reseted. There is no other relevant message.
> 
> consider first turning off the blanking console screensaver,
> and possibly running a serial console for logging purposes.
> I assume you also mean that magic-sysrq doesn't work either.
> I find this normally implicates system-level HW problems
> (heat, power, etc)
> 
> regards, mark hahn.
> 
> 
> --__--__--
> 
> Message: 7
> Date: Wed, 31 Jul 2002 11:49:06 -0400 (EDT)
> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: Ray Schwamberger <ray at advancedclustering.com>
> Cc: beowulf at beowulf.org
> Subject: Re: Problems with dual Athlons
> 
> On Wed, 31 Jul 2002, Ray Schwamberger wrote:
> 
> > You might try the noapic option. I'm thinking there may be some kind
> of 
> > issues with APIC, AMD and 2.4.18.
> 
> We don't have ASUS systems but instead a mix of Tyan 2460 and 2466
> systems and see very similar things, including the bizarreness of the
> blind crash problems appearing on one system (consistently are
> repeatedly) but not another IDENTICAL system sitting right next to it.
> 
> We have found that power supplies (both the power line itself and the
> switching power supply in the chassis) can make a difference on the
> 2466's -- a marginal power supply is an invitation to problems for
> sure
> on these beasties.  This is reflected in the completely outrageous
> observation that I have some nodes that will boot and run stably when
> plugged into certain receptacles on the power pole, but not other
> receptacles.  If I put a polarity/circuit tester on the receptacles,
> they pass.  If I check the line voltages, they are nominal (120+ VAC).
> If I plug any 2466 into them (I tried 3), it fails to POST.  If I move
> the plug two receptacles up on the same pole and same circuit, it
> POSTS,
> installs, and works fine.  I haven't put an oscilloscope on the line
> when plugging it in, but I'm sure it would be fascinating to do so.
> 
> We're also in the problem of investigating kernel snapshot
> dependencies
> and the SMP issues aforementioned as we continue to try to stabilize
> our
> 2460's, which seem even more sensitive than the 2466's (which so far
> seem to run stably and and give decent performance overall).
> Unfortunately, our crashes occur with a mean time of days to a week or
> two under load in between (consistent with a rare interrupt conflict
> or
> SMP issue) so it takes a long time to test a potential fix.  We did
> avoid a crash for about 9 days on a 2460 running 2.4.18-5 (Red Hat's
> build id) after experiencing crashes on the node every 5-10 days, but
> are only just now accumulating better statistics on a group of nodes
> instead of just the one.
> 
> So overall, I concur -- try different smp kernel releases and
> snapshots,
> try rearranging the cards (order often seems to matter) and bios
> settings, try --noapic (which we should probably also do -- we haven't
> so far) and yes, try rearranging the way the nodes are plugged in.
> Notice that this is evil and insidious -- you can pull a node from a
> rack and bench it and it will run fine forever, but if you plug it
> back
> in to the same receptacle when you put it back, it has problems.
> Maddening.
> 
>    rgb
> 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> 
> 
> --__--__--
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> End of Beowulf Digest
> 



More information about the Beowulf mailing list