[Beowulf] VASP on Clusters

Thu May 5 09:13:15 PDT 2005

We are having a problem running VASP on our clusters. (The old cluster
is 48 nodes 1.8GHZ Atlons, 1GB RAM, GigE, and a slightly newer 12 node
dual 2.6GHz Xeons, 2GB RAM, GigE) We have installed the program and run
the benchmarks. (With some compilation adjustments on the old system to
use smaller block sizes to eliminate some network receive buffer
overruns.) Being a poor humble system administrator I don't pretend to
understand the meaning of the output, but I'm assured that the results
on both clusters were acceptable.  

The problem is that when we run the program on larger jobs the results
are inconsistent.  As I understand it, the older cluster will run small
systems, but gives erroneous results on medium and large jobs.  The
newer cluster gives good results on small and medium runs but fails on a
large test case. (The results are different then those from a single
processor job run on two different machines.)  I'm sorry for the
somewhat vague description of small, medium, and large, but the experts
tell me that there are so many different parameters to a VASP job that
can increase the complexity of the run that it is nearly impossible to
explain it to a mere mortal like myself.

I'm doing some tests to see if the network buffer overruns have returned
with these larger jobs, and if I can eliminate them by making the MPI
block size even smaller but this reduces efficiency.)  I'm hoping a VASP
expert on the list can point me to a solution, a set of minimal hardware
necessary to avoid these problems, or telling me that it is impossible
and that I should quit my job and just go get a beer.  I've looked for
but not found a VASP users list, so I'm hoping someone on this list can
help.

Thanks,

Dan