More on cluster hang problem....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Cris Rhea crhea at mayo.eduThu Jun 7 13:10:37 PDT 2001
- Previous message: dual/quadruple ports on a NICs
- Next message: PBS on beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
First, let me thank the folks who offered suggestions on how to diagnose this problem and suggestions for solutions... --------------------------------- Jon Tegner - Pointed me at a note on the VA linux tech list about RH7.1 messing up disk partitioning. David van der Spoel - Pointer to a web site discussing flakey hardware (esp memory). Tony Skjellum - Suggested using his company's commercial version of MPI. Patrick Lesher - Suggested overheating and using a software package called "sensors" to read MB temps. Mark Hahn - There are known problems with the KT133A-based systems. John LaBounty - How to force power off with an ATX power supply. Robert G. Brown - In his experience, this points to a memory leak and/or swap space issue. Kevin Simpson - Script for monitoring for memory leak, etc. Jacobs - Out of RAM issue. --------------------------------- Where we are now....... Nothing jumped out after reading all the suggestions pointing to our problem. John LaBounty's comments on ATX supplies were very helpful, as it allowed us to power cycle a stuck node without rebooting the next node (on a RackSaver RS1200, there are 2 systems in the same 1U box with a single power cord). Some more data points and ideas- 1. Went to 2.4.4 kernel on all 4 nodes. No change in the behavior. 2. Built similar mini-cluster on 2 Dell 2450's that arrived for a different project (1GHz PIII's). One system is a single CPU, the other has two CPUs. Application runs perfectly on the Dells. Will run to completion reliably (set in a loop to re-run after it finishes- has so far, run 8 15-hour runs without a problem). 3. Issue with RAM and swap- swap was config'ed as 2X RAM (1GB physical RAM in each system). Application does NOT memory leak (as measured by xosview [cool little tool!]). 4. Nodes are named "rsnode1" ... "rsnode4". If we run only on nodes 3 and 4, things run fine (again, no memory leak over a ~15 hour run). Will run on these 2 nodes fine without crashing. If I run on rsnode2, rsnode3 and rsnode4- It will crash rsnode2 after an hour or so. If I run on rsnode1 and rsnode2- it will crash rsnode1 after ~10 mins. 5. No messages at all in /var/log/messages around the crash time. I think I'm back to flakey hardware in rsnode1 and rsnode2. Any time I involve rsnode1- things crash withing 10-15 minutes. Any time I involve rsnode2, things go longer, but still crash. With two out of four systems involved in the issue, I assumed it was a code/kernel issue rather than just a simple hardware one. Since these two nodes (rsnode1 and rsnode2) are physically in the same 1U box, I suspect a batch of bad parts somewhere along the way. I think my next experiment will be to configure MPI to use rsnode3 and rsnode4 as well as the two Dell 2450's.... Stay tuned to the soap opera.... --- Cris --- Cristopher J. Rhea Mayo Foundation Research Computing Facility Pavilion 2-25 crhea at Mayo.EDU Rochester, MN 55905 Fax: (507) 266-4486 (507) 284-0587
- Previous message: dual/quadruple ports on a NICs
- Next message: PBS on beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
