(Fwd) Re: HELP! linux cluster with LAM-MPI

Fri Feb 9 07:46:14 PST 2001

From:           	khocha at icu.ac.kr
To:             	beowulf at beowulf.org
Date sent:      	Fri, 9 Feb 2001 20:35:21 +0900
Subject:        	HELP! linux cluster with LAM-MPI

> Dear All.
> 
> I'm a graduate student of 'Information and Communications Univ.' in Korea. In
> our Lab., we built diskless clustering system with Intel L440GX+ board. 
> 
> Our system used Linux kernel 2.2.13 and LAM-MPI 6.3.2.
> By the way, during the test, the system made unexpected troubles.
> 
> The MPI-test program has only two communications (that means it has 'EP'
> style). (1. distribute data(in beginning part), 2 collect result data(in
> endding part)). It uses only a little memory, but has many loop operations. 
> 
> With a few iteration, it works well, but when we increase the number of loop
> operations for solving some difficult problems, a node displays error message
> as follow, and then it is downed.
> 
> ==============================================================================
> == ====== [root at node11 root]# Unable to handle kernel paging request at
> virtual address e6 70e602 current->tss.cr3 = 07591000, %cr3 = 07591000 *pde =
> 00000000 Oops: 0002 CPU:    1 EIP:    0010:[] EFLAGS: 00010246 eax: 00000000  
> ebx: c7593fb4   ecx: 00000286   edx: 00000000 esi: 00000000   edi: c7592000  
> ebp: c7593fbc   esp: c7593fa0 ds: 0018   es: 0018   ss: 0018 Process vital
> (pid: 424, process nr: 20, stackpage=c7593000) Stack: bffffe14 00000032
> 00000005 00000000 c7592000 00000000 1dcd6500 bffffd3c 
>        c0109fb8 bffffd34 00000000 40107bec 00000000 bffffe14 bffffd3c 000000a2
>        c010002b 0000002b 000000a2 400a9f51 00000023 00000206 bffffd14 0000002b
>        
> Call Trace: [] [] 
> Code: 00 b0 02 e6 70 e6 80 e4 71 e6 80 88 c1 31 d2 88 ca 89 54 24 
> ==============================================================================
> == ======
> 
> Please~~, tell us the hint to solve this problem.
> 
> p.s. Our system are consist of
> -------------------------------
> L440GX+ (Dual Pentium III 550MHz, 24 cluster nodes, each node doesn't have a
> disk, it use server's RAID), Compaq  Proliant 1600 server (Dual Pentium III
> 600MHz , server), Serial HUB (Comtrol Rocketport), Fast Ethernet Hub (3com ),
> 108 GB RAID
> 
Sir,

It seems like you have a kernel fault more than a LAM fault. Does this occur 
on one node only, or on random nodes? If it only happens on one node, I 
would assume that node has some sort of faulty hardware. Unfortunately, I 
am not qualified to analyse your Oops.

You might want to update your kernel, either to 2.2.18 or 2.4.1 and see if the
same oops occurs - if it still does, and at that on random machines, you 
might get better help if you go to linux-kernel.

Good luck.

Yours,
-Simen

--
Simen Thoresen, Beowulf-cleaner and random artist.

Er det ikke rart?
The gnu RART-project on http://valinor.dolphinics.no:1080/~simentt/rart