Beowulf digest, Vol 1 #967 - 7 msgs

Thu Aug 1 07:41:49 PDT 2002

hi Ravi,

You seem to be describing the classic symptoms of a dirty power source 
(moving it to the desktop corrected the problem and re-arranging components 
sometimes clears the problem while other times it did not.)  Have you had the 
power going into the rack checked recently?

Louis

On Wednesday 31 July 2002 01:56 pm, you wrote:
> We have seen a large number of issues with the Gigabyte dual Athlon boards
> (GA- 7DPXDW).
>
> Here is our configuration:
> 1. Dual AMD Athlon MP2000+
> 2. 2 40GB Seagate ST340016A drives (IDE)
> 3. 1 Gigabit ethernet NIC
> 4. 3.5GB memory (Virtium modules, 3 1GB modules and 1 512MB module, all
> 64x4 256Mbit chip stacks)
> 5. 400W power supply
>
> Some of the problems we have seen:
> 1. Sporadic booting. Some machines booted perfectly everytime, and others
> booted only 50% of the time. There did not seem to be any real correlation
> with power supply or memory modules: changing either didn't fix the
> problem. Sometimes, moving a machine from the rack to a desk fixed the
> problem, and then moving it back to the rack caused the problem to
> resurface. Sometimes, changing around the order of the memory modules
> helped, and other times it did not.
>
> 2. ECC errors. We saw a large number of ECC errors in these machines.
> Changing the memory modules often helped, but after a few days of running,
> the problems would reappear. Sometimes, the problem was persistent: even if
> software was written to reset the ECC bits appropriately, the bits didn't
> appear to be changed, and the error reappeared.
>
> 3. DMA IRQ errors. After many hours of burning, there would be some disk
> issue related to dma_irq. We had seen many errors and crashes after 2 days
> or so, but most of these were linked to a bug in the 2.4.18-3smp kernel,
> and were fixed by upgrading to 2.4.18-5.
>
> 4. Console redirection issues. We noticed that a number of machines would
> stop responding to the console after a few days. They were otherwise
> burning fine (you could ssh to them, for example), but the console seemed
> to have stopped responding. You might take it off the rack and the console
> would respond again, then put it on the rack and it would be fine. Then a
> few days later, it would stop responding again.
>
> We seemed to have better success with the 2.4.7 or 2.4.9 kernel rather than
> 2.4.18, although a number of the above errors would crop up no matter which
> kernel we used.
>
> -Ravi
>
> Quoting beowulf-request at beowulf.org:
> > Send Beowulf mailing list submissions to
> > 	beowulf at beowulf.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > 	http://www.beowulf.org/mailman/listinfo/beowulf
> > or, via email, send a message with subject or body 'help' to
> > 	beowulf-request at beowulf.org
> >
> > You can reach the person managing the list at
> > 	beowulf-admin at beowulf.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Beowulf digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re:Is there any work management tools like that. (Donald Becker)
> >    2. Re:Gentoo and Beowulf-ish clusters (Dean Johnson)
> >    3. Problems with dual Athlons (Manel Soria)
> >    4. Re:Problems with dual Athlons (Ray Schwamberger)
> >    5. Re:Problems with dual Athlons (Alberto Ramos)
> >    6. Re:Problems with dual Athlons (Mark Hahn)
> >    7. Re:Problems with dual Athlons (Robert G. Brown)
> >
> > --__--__--
> >
> > Message: 1
> > Date: Tue, 30 Jul 2002 12:01:54 -0400 (EDT)
> > From: Donald Becker <becker at scyld.com>
> > To: William Thies <samsarazeal at yahoo.com>
> > cc: beowulf at beowulf.org
> > Subject: Re: Is there any work management tools like that.
> >
> > On Tue, 30 Jul 2002, William Thies wrote:
> > > We need such kind of work management tools working on
> > > a 32-node cluster.
> >
> > ..
> >
> > > 1. We will always run a very large master-slave
> > > program on this cluster.
> >
> > ..
> >
> > > 2. Sometimes, we need to use this cluster to do other
> > > works.
> >
> > Most any scheduling system can handle this kind of job allocation, at
> > least for new jobs.
> >
> > The devil is in the details.  For the large job workload, is that job
> > a
> > number of short-lived independent processes, or a single
> > job with many long-lived communicating processes?
> >
> > > (1) We want to power off 8 nodes first,
> >
> > Why power off?  You can use WOL or IPMI, but that power-cycle will
> > take
> > on the order of minutes -- far longer than scheduling, and
> > significantly
> > longer than other approaches to clearing the machine state.  The Scyld
> > system can clear the machine state in just a few seconds.
> >
> > > And at that time we don't want the GA program to use those 8 nodes
> >
> > Every scheduling system can prevent jobs #1 from allocating new
> > processes on the reserved nodes.  The question is, what happens to
> > the processes of job #1?
> >     Are they short-lived enough that they will terminate naturally in
> > a
> >       few seconds?
> >     Can the slave processes just be suspended?
> >     Do you expect the system to check-point and restart them later?
> >      (If so, what about the non-check-pointed processes they are
> >       communicating with?)
> >     Do you expect the system to migrate them to another node?
> >      (Again, what are you communication expectations?)
> >     Can the processes be signalled to check-point or migrate itself?
> >       (Scyld Beowulf provides tools to make this very easy, but it's
> > not
> >        a common feature on other scheduling system.)
> >
> > > 3. This should be a multi-user management tool.
> > > Would you like to recommend some tools like that?
> > > Thanks very much!
> >
> > We provide a queuing, scheduling and node allocation systems(*) that
> > can
> > accomplish this within a cluster.  If you need site-wide scheduling
> > (multiple OSes, a mix of cluster and independant nodes, crossing
> > firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.
> >
> >
> > --
> > Donald Becker				becker at scyld.com
> > Scyld Computing Corporation		http://www.scyld.com
> > 410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
> > Annapolis MD 21403			410-990-9993
> >
> >
> > --__--__--
> >
> > Message: 2
> > Subject: Re: Gentoo and Beowulf-ish clusters
> > From: Dean Johnson <dtj at uberh4x0r.org>
> > To: Andrew Fant <fant at pobox.com>
> > Cc: beowulf at beowulf.org
> > Date: 30 Jul 2002 21:10:21 -0500
> >
> > On Mon, 2002-07-29 at 17:45, Andrew Fant wrote:
> > > Evening all,
> > >   Has anyone got any experience using Gentoo as the base distro for
> >
> > a
> >
> > > Linux cluster?  For various reasons (both technical and political),
> >
> > RedHat
> >
> > > is not a particulary viable option on this project, and since I have
> >
> > been
> >
> > > so happy with the results of using Gentoo on a couple of smaller
> >
> > systems,
> >
> > > it seems like an option to consider.  Thanks to all in advance for
> >
> > any
> >
> > > information.
> >
> > Most of the typical cluster software that you would use (mpich, etc)
> > shouldn't present any sort of problems, apart from perhaps just having
> > to build them all, which I suspect most people do anyways. From what I
> > know of gentoo, it isn't THAT different to cause a problem. Most of
> > the
> > important cluster software has been ported to many OSes, so it is
> > pretty
> > asymtotic to nerdvana wrt porting.
> >
> > Relating to that, I just built ganglia under solaris and it built and
> > worked totally as advertised after, of course, I went through all the
> > hassle of getting gcc and such installed on the solaris box.
> >
> > 	-Dean
> >
> >
> > --__--__--
> >
> > Message: 3
> > Date: Wed, 31 Jul 2002 10:17:16 +0200
> > From: Manel Soria <manel at labtie.mmt.upc.es>
> > Organization: UPC - Laboratori de Termotecnia i Energetica
> > To: beowulf at beowulf.org
> > Subject: Problems with dual Athlons
> >
> > Hi,
> >
> > Please let me report a problem that we have  in our cluster with dual
> > Athlons
> > in case that somebody can help us.
> >
> > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> > Two of them crash frequently and the other two run fine.
> > We have tried to replace different hardware  components and
> > desactivate
> > the SMP option but the problem persists.
> >
> > The main difference between them is that the systems that crash
> > (the servers) have two network interfaces while the systems that run
> > fine (normal nodes) have only one network interface.
> > Can this be the cause of the problem ?  Would it be a good idea to use
> > another version of gcc ?
> >
> > The motherboard is an ASUS AM7M266-D. One of the systems that
> > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > that don't crash run Debian 2.1.
> >
> > "Crash" here means that the VGA display is blank and the system has to
> > be reseted. There is no other relevant message.
> >
> > Thanks
> >
> >
> > --
> > ===============================================
> > Dr. Manel Soria
> > ETSEIT - Centre Tecnologic de Transferencia de Calor
> > C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> > Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> > E-Mail: manel at labtie.mmt.upc.es
> >
> >
> >
> >
> > --__--__--
> >
> > Message: 4
> > Date: Wed, 31 Jul 2002 08:33:56 -0500
> > From: Ray Schwamberger <ray at advancedclustering.com>
> > Reply-To: ray at advancedclustering.com
> > To: beowulf at beowulf.org
> > Subject: Re: Problems with dual Athlons
> >
> > I've seen similar issues with some dual athlon systems, the issues as I
> >
> > best can sort them from playing around with options...
> >
> > 1)  OS/kernel version does seem to make a difference.  The same systems
> >
> > that run Red Hat 7.2 with perfect stability would crash within minutes
> >
> > with complaints about interrupt handlers and modprobe errors for the
> > binfmt-0000 module, usually trashing their filesystems in the process.
> >
> > Perhaps 2.4.18 is not a good choice for running dual Athlons, I've had
> >
> > very limited time to play with this idea but that is the largest
> > coincidence I've managed to see so far.
> >
> > 2) While doing channel bonding, we were getting very uneven transfer
> > across the bonded interfaces.  Once again this was dual athlons and
> > using the newer, supposedly SMP-safe bonding driver.  The same machines
> >
> > running on a uni-processor kernel showed no issues at all, therefore it
> >
> > had to be a SMP issue.  Using the 'noapic'  kernel option at boot time
> >
> > smoothed this one out, but again it points at something in the newest
> > kernels not agreeing readily with dual athlons, or perhaps the 762/768
> >
> > chipset combination.
> >
> > You might try the noapic option. I'm thinking there may be some kind of
> >
> > issues with APIC, AMD and 2.4.18.
> >
> > Manel Soria wrote:
> > > Hi,
> > >
> > > Please let me report a problem that we have  in our cluster with dual
> >
> > Athlons
> >
> > > in case that somebody can help us.
> > >
> > > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> > > Two of them crash frequently and the other two run fine.
> > > We have tried to replace different hardware  components and
> >
> > desactivate
> >
> > > the SMP option but the problem persists.
> > >
> > > The main difference between them is that the systems that crash
> > > (the servers) have two network interfaces while the systems that run
> > > fine (normal nodes) have only one network interface.
> > > Can this be the cause of the problem ?  Would it be a good idea to
> >
> > use
> >
> > > another version of gcc ?
> > >
> > > The motherboard is an ASUS AM7M266-D. One of the systems that
> > > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > > that don't crash run Debian 2.1.
> > >
> > > "Crash" here means that the VGA display is blank and the system has
> >
> > to
> >
> > > be reseted. There is no other relevant message.
> > >
> > > Thanks
> > >
> > >
> > > --
> > > ===============================================
> > > Dr. Manel Soria
> > > ETSEIT - Centre Tecnologic de Transferencia de Calor
> > > C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> > > Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> > > E-Mail: manel at labtie.mmt.upc.es
> > >
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> >
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> >
> > --__--__--
> >
> > Message: 5
> > Date: Wed, 31 Jul 2002 16:33:43 +0200
> > To: Manel Soria <manel at labtie.mmt.upc.es>
> > Cc: Lista de correo sobre Beowulf <beowulf at beowulf.org>
> > Subject: Re: Problems with dual Athlons
> > From: Alberto Ramos <alberto at delta.ft.uam.es>
> >
> >
> >   Hi,
> >
> >   I dont know if this will help you, but we have a little beowulf of
> > dual
> > Athlon, with Tyan S2466N-4M Motherboards running Debian 3.0 (Woody) and
> > gcc
> > 2.95.4 without any problem.
> >
> >   Dou you have "good" memory? With our motherboard we have lots of
> > problems
> > with the memory...
> >
> >   Other question is if the system crash when they are hight loaded or
> > this
> > has othing to do.
> >
> >   Alberto.
> >
> > On Wed, Jul 31, 2002 at 10:17:16AM +0200, Manel Soria wrote:
> > > Hi,
> > >
> > > Please let me report a problem that we have  in our cluster with dual
> >
> > Athlons
> >
> > > in case that somebody can help us.
> > >
> > > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> > > Two of them crash frequently and the other two run fine.
> > > We have tried to replace different hardware  components and
> >
> > desactivate
> >
> > > the SMP option but the problem persists.
> > >
> > > The main difference between them is that the systems that crash
> > > (the servers) have two network interfaces while the systems that run
> > > fine (normal nodes) have only one network interface.
> > > Can this be the cause of the problem ?  Would it be a good idea to
> >
> > use
> >
> > > another version of gcc ?
> > >
> > > The motherboard is an ASUS AM7M266-D. One of the systems that
> > > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > > that don't crash run Debian 2.1.
> > >
> > > "Crash" here means that the VGA display is blank and the system has
> >
> > to
> >
> > > be reseted. There is no other relevant message.
> > >
> > > Thanks
> > >
> > >
> > > --
> > > ===============================================
> > > Dr. Manel Soria
> > > ETSEIT - Centre Tecnologic de Transferencia de Calor
> > > C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> > > Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> > > E-Mail: manel at labtie.mmt.upc.es
> > >
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> >
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
> > --__--__--
> >
> > Message: 6
> > Date: Wed, 31 Jul 2002 11:15:12 -0400 (EDT)
> > From: Mark Hahn <hahn at physics.mcmaster.ca>
> > To: Manel Soria <manel at labtie.mmt.upc.es>
> > cc: <beowulf at beowulf.org>
> > Subject: Re: Problems with dual Athlons
> >
> > > We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
> >
> > it's definately worth your while to try a more recent kernel
> > (ie, 2.4.19-rc3, possibly the latest ac or aa version.)
> >
> > > Two of them crash frequently and the other two run fine.
> > > We have tried to replace different hardware  components and
> >
> > desactivate
> >
> > > the SMP option but the problem persists.
> >
> > how seriously have you addressed the hardware explanation?
> > for instance, have you verified that the CPU fans are mounted
> > properly?  is there any temperature correlation to when the crashes
> > happen?  do you have a reason to believe the dimms are good?
> > how about bios settings (esp wrt memory timings) and/or bios versions?
> > how about power supplies?  it's useful to have a "monster" 450W
> > PS from a name-brand like Enermax around that you know is good,
> > but really only use for testing.
> >
> > > The main difference between them is that the systems that crash
> > > (the servers) have two network interfaces while the systems that run
> > > fine (normal nodes) have only one network interface.
> >
> > bonding?  incidentally, are you using MPS 1.4 and kernel apic support?
> >
> > > Can this be the cause of the problem ?  Would it be a good idea to
> >
> > use
> >
> > > another version of gcc ?
> >
> > 2.95.2 is still recommended for 2.4 I believe.  I recall AC saying that
> >
> > it had some trouble with 2.5 though.
> >
> > > The motherboard is an ASUS AM7M266-D. One of the systems that
> > > crashes is running Debian 2.1  and the other Debian 2.2. The systems
> > > that don't crash run Debian 2.1.
> >
> > I don't see why userspace would matter.
> >
> > > "Crash" here means that the VGA display is blank and the system has
> >
> > to
> >
> > > be reseted. There is no other relevant message.
> >
> > consider first turning off the blanking console screensaver,
> > and possibly running a serial console for logging purposes.
> > I assume you also mean that magic-sysrq doesn't work either.
> > I find this normally implicates system-level HW problems
> > (heat, power, etc)
> >
> > regards, mark hahn.
> >
> >
> > --__--__--
> >
> > Message: 7
> > Date: Wed, 31 Jul 2002 11:49:06 -0400 (EDT)
> > From: "Robert G. Brown" <rgb at phy.duke.edu>
> > To: Ray Schwamberger <ray at advancedclustering.com>
> > Cc: beowulf at beowulf.org
> > Subject: Re: Problems with dual Athlons
> >
> > On Wed, 31 Jul 2002, Ray Schwamberger wrote:
> > > You might try the noapic option. I'm thinking there may be some kind
> >
> > of
> >
> > > issues with APIC, AMD and 2.4.18.
> >
> > We don't have ASUS systems but instead a mix of Tyan 2460 and 2466
> > systems and see very similar things, including the bizarreness of the
> > blind crash problems appearing on one system (consistently are
> > repeatedly) but not another IDENTICAL system sitting right next to it.
> >
> > We have found that power supplies (both the power line itself and the
> > switching power supply in the chassis) can make a difference on the
> > 2466's -- a marginal power supply is an invitation to problems for
> > sure
> > on these beasties.  This is reflected in the completely outrageous
> > observation that I have some nodes that will boot and run stably when
> > plugged into certain receptacles on the power pole, but not other
> > receptacles.  If I put a polarity/circuit tester on the receptacles,
> > they pass.  If I check the line voltages, they are nominal (120+ VAC).
> > If I plug any 2466 into them (I tried 3), it fails to POST.  If I move
> > the plug two receptacles up on the same pole and same circuit, it
> > POSTS,
> > installs, and works fine.  I haven't put an oscilloscope on the line
> > when plugging it in, but I'm sure it would be fascinating to do so.
> >
> > We're also in the problem of investigating kernel snapshot
> > dependencies
> > and the SMP issues aforementioned as we continue to try to stabilize
> > our
> > 2460's, which seem even more sensitive than the 2466's (which so far
> > seem to run stably and and give decent performance overall).
> > Unfortunately, our crashes occur with a mean time of days to a week or
> > two under load in between (consistent with a rare interrupt conflict
> > or
> > SMP issue) so it takes a long time to test a potential fix.  We did
> > avoid a crash for about 9 days on a 2460 running 2.4.18-5 (Red Hat's
> > build id) after experiencing crashes on the node every 5-10 days, but
> > are only just now accumulating better statistics on a group of nodes
> > instead of just the one.
> >
> > So overall, I concur -- try different smp kernel releases and
> > snapshots,
> > try rearranging the cards (order often seems to matter) and bios
> > settings, try --noapic (which we should probably also do -- we haven't
> > so far) and yes, try rearranging the way the nodes are plugged in.
> > Notice that this is evil and insidious -- you can pull a node from a
> > rack and bench it and it will run fine forever, but if you plug it
> > back
> > in to the same receptacle when you put it back, it has problems.
> > Maddening.
> >
> >    rgb
> >
> > Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > Beowulf mailing list
> > Beowulf at beowulf.org
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> > End of Beowulf Digest
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Louis J. Romero
Email: louisr at aspsys.com
Local: (303) 431-4606

Aspen Systems, Inc.
3900 Youngfield Street
Wheat Ridge, Co 80033
Toll Free: (800) 992-9242
Fax: (303) 431-7196
URL: http://www.aspsys.com