Request on advice on which kernel? 2.2 or 2.4?

Donald Becker becker at scyld.com
Sun Oct 7 16:40:28 PDT 2001


On Wed, 3 Oct 2001, Martin Siegert wrote:
> On Wed, Oct 03, 2001 at 09:57:44AM -0400, Donald Becker wrote:
> > The biggest advantage of 2.4 kernel is the SMP improvements to the
> > network stack.  You'll see less benefit with your single processor
> > nodes, with most of the benefit on four processor nodes.
> 
> This brings up another issue: the APIC code (bugs?) in the 2.4 series
> of kernels. I encouter the following problem: when using 2.4 kernels
..
> the LAM MPI distribution some MPI programs will hang almost every time.
..
> - the program hangs when executing a r = read(sock, buf, nbytes) statement
>   over and over again. Typically: r=56 or r=696 and nbytes=116765796, i.e.,
>   if you decrease 116765796 in steps of 56 or 696, the program hangs for
>   practical purposes.
> 
> - when using mpich the program does not hang.

This is very curious.  I wouldn't expect a difference between the two.

> - when using the 2.2.19 smp kernel the program does not hang.

Does the 2.2.17 kernel hang?  There have been various APIC "fixes" over time,
but 2.2.17 definitely had the APIC bug.

(The APIC bug fix cycle usually alternates between "there is no bug,
it's the device driver" and "fixed an APIC bug, it works perfectly
now".)

> >From this I concluded that I cannot use a 2.4 kernel and LAM. I do not know
> with certainty what is causing the failures:
> 
> - is it a LAM bug?
> 
> - is it a 3c59x driver bug?

Possibly, but not obviously.  All of my network drivers timer-driven
checks for failure to complete operations.  The exact error message
indicates if the chip has just stopped operation, or it's waiting for
interrupt service.

A few drivers even fall back to timer-based polling specifically to be
able to log that the APIC has failed.

> - is it a 2.4 kernel bug?
> 
> Besides this problem I have encountered by now several RedHat 7.1 machines
> on campus (UP or SMP) that had network problems which could be solved by
> including the "noapic" option in lilo.conf. Are there chances that the
> APIC problems in the 2.4 kernels are resolved soon (there seem to be changes
> to the APIC code in 2.4.10, but I still have problems)? Is there a performance
> hit related to the "noapic" option?

There is a performance hit, but slightly slower is better than randomly
broken.  We ship our 2.2 kernels with "noapic" enabled.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993





More information about the Beowulf mailing list