HELP! linux cluster with LAM-MPI

Fri Feb 9 05:27:53 PST 2001

From:           	khocha at icu.ac.kr
To:             	beowulf at beowulf.org
Date sent:      	Fri, 9 Feb 2001 20:35:21 +0900
Subject:        	HELP! linux cluster with LAM-MPI

> Dear All.
> 
> I'm a graduate student of 'Information and Communications Univ.' in Korea. In
> our Lab., we built diskless clustering system with Intel L440GX+ board. 
> 
> Our system used Linux kernel 2.2.13 and LAM-MPI 6.3.2.
> By the way, during the test, the system made unexpected troubles.
> 
> The MPI-test program has only two communications (that means it has 'EP' style).
> (1. distribute data(in beginning part), 2 collect result data(in endding part)).
> It uses only a little memory, but has many loop operations. 
> 
> With a few iteration, it works well, but when we increase the number of loop
> operations for solving some difficult problems, a node displays error message as
> follow, and then it is downed.
> 
> ================================================================================
> ====== [root at node11 root]# Unable to handle kernel paging request at virtual
> address e6 70e602 current->tss.cr3 = 07591000, %cr3 = 07591000 *pde = 00000000
> Oops: 0002 CPU:    1 EIP:    0010:[] EFLAGS: 00010246 eax: 00000000   ebx:
> c7593fb4   ecx: 00000286   edx: 00000000 esi: 00000000   edi: c7592000   ebp:
> c7593fbc   esp: c7593fa0 ds: 0018   es: 0018   ss: 0018 Process vital (pid: 424,
> process nr: 20, stackpage=c7593000) Stack: bffffe14 00000032 00000005 00000000
> c7592000 00000000 1dcd6500 bffffd3c 
>        c0109fb8 bffffd34 00000000 40107bec 00000000 bffffe14 bffffd3c 000000a2
>        c010002b 0000002b 000000a2 400a9f51 00000023 00000206 bffffd14 0000002b 
> Call Trace: [] [] 
> Code: 00 b0 02 e6 70 e6 80 e4 71 e6 80 88 c1 31 d2 88 ca 89 54 24 
> ================================================================================
> ======
> 
> Please~~, tell us the hint to solve this problem.
> 
> p.s. Our system are consist of
> -------------------------------
> L440GX+ (Dual Pentium III 550MHz, 24 cluster nodes, each node doesn't have a
> disk, it use server's RAID), Compaq  Proliant 1600 server (Dual Pentium III
> 600MHz , server), Serial HUB (Comtrol Rocketport), Fast Ethernet Hub (3com ),
> 108 GB RAID
> 
Sir,

It seems like you have a kernel fault more than a LAM fault. Does this occur 
on one node only, or on random nodes? If it only happens on one node, I 
would assume that node has some sort of faulty hardware. Unfortunately, I 
am not qualified to analyse your Oops.

You might want to update your kernel, either to 2.2.18 or 2.4.1 and see if the 
same oops occurs - if it still does, and at that on random machines, you 
might get better help if you go to linux-kernel.

Good luck.

Yours,
-Simen
--
Simen Thoresen, Beowulf-cleaner and random artist.

Er det ikke rart?
The gnu RART-project on http://valinor.dolphinics.no:1080/~simentt/rart