[Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's)
Mark Hahn
hahn at physics.mcmaster.ca
Wed Jan 3 07:53:35 PST 2007
> " p1_8544: p4_error: Timeout in Establishing connection to remote process:
> 0 "
> rm_l_1_8667: (359.417969) net_send: could not write to fd=5, errno=104
>
> We have been trying the same for the past two days and we didnt get any
> solution for the above.
but what have you tried? I would guess that this is a simple rsh config
problem, nothing to do with mpich.
> Also we downloaded the Latest MPICH 1.2.7p1 and configured the same. now for
but why do you think the problem lies with mpich?
> The same testing with LAM/MPI and OPENMPI are working fine.
lam being mostly just a previous version of lam, and I think inheriting
lam's agent-based process-starting, no?
personally, I'm pretty convinced that MPI implementations should stay
out of the jobstarter business, and go with straight agentless (ssh-based)
job spawning.
More information about the Beowulf
mailing list