HELP! linux cluster with LAM-MPI

khocha at khocha at
Fri Feb 9 03:35:21 PST 2001

Dear All.

I'm a graduate student of 'Information and Communications Univ.' in Korea. 
In our Lab., we built diskless clustering system with Intel L440GX+ board. 

Our system used Linux kernel 2.2.13 and LAM-MPI 6.3.2.
By the way, during the test, the system made unexpected troubles.

The MPI-test program has only two communications (that means it has 'EP' style).
(1. distribute data(in beginning part), 2 collect result data(in endding part)).
It uses only a little memory, but has many loop operations. 

With a few iteration, it works well, but when we increase the number of loop operations 
for solving some difficult problems, a node displays error message as follow, and then 
it is downed.

[root at node11 root]# Unable to handle kernel paging request at virtual address e6
current->tss.cr3 = 07591000, %cr3 = 07591000
*pde = 00000000
Oops: 0002
CPU:    1
EIP:    0010:[]
EFLAGS: 00010246
eax: 00000000   ebx: c7593fb4   ecx: 00000286   edx: 00000000
esi: 00000000   edi: c7592000   ebp: c7593fbc   esp: c7593fa0
ds: 0018   es: 0018   ss: 0018
Process vital (pid: 424, process nr: 20, stackpage=c7593000)
Stack: bffffe14 00000032 00000005 00000000 c7592000 00000000 1dcd6500 bffffd3c 
       c0109fb8 bffffd34 00000000 40107bec 00000000 bffffe14 bffffd3c 000000a2 
       c010002b 0000002b 000000a2 400a9f51 00000023 00000206 bffffd14 0000002b 
Call Trace: [] [] 
Code: 00 b0 02 e6 70 e6 80 e4 71 e6 80 88 c1 31 d2 88 ca 89 54 24 

Please~~, tell us the hint to solve this problem.

p.s. Our system are consist of
L440GX+ (Dual Pentium III 550MHz, 24 cluster nodes, each node doesn't have a disk, it use server's RAID),
Compaq  Proliant 1600 server (Dual Pentium III 600MHz , server),
Serial HUB (Comtrol Rocketport),
Fast Ethernet Hub (3com ),

Your quick reply will be highly appreciated.
Best Regards.

More information about the Beowulf mailing list