[Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process

Tahir Malas tmalas at ee.bilkent.edu.tr
Tue Jun 12 00:25:37 PDT 2007

Hi all,
We have an 8 dual quad-core node HP cluster connected via Infiniband. We use
Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH
0.9.7. We have two interesting problems that we could not overcome yet:

1. In our test program which mimics the communications in our code, the
nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We
perform one to one communications between these pairs of nodes
simultaneously. We use blocking MPI send and receive commands to communicate
an integer array of various sizes. In addition, we consider different
numbers of processes:
(a) 1 process per node, 8 processes overall: One link is established between
the pairs of nodes.
(b) 2 process per node, 16 processes overall: Two links are established
between the pairs of nodes.
(c) 4 process per node, 32 processes overall: Four links are established
between the pairs of nodes.
(d) 8 process per node, 64 processes overall: Eight links are established
between the pairs of nodes.

We obtain logical timings, except for the following interesting comparison:

For 32 processes (4 process per node), the arrays with 512-Byte size are
communicated slower than the 4096-Byte size arrays. For both of them, we
send/receive 1,000,000 arrays and take the average to find the time per
package. Only package size changes. We have made many trials and confirmed
this abnormal case is persistent. More specifically, communication of
4k-Byte packages are 2 times faster than the communication of 512-Byte

The OSU bandwidth and latency test around these points shows:
Byte			MB/s
256             417.53
512             592.34
1024            691.02
2048            857.35
4096            906.04
8192            1022.52
		Time (usec)
256		    4.79
512             5.48
1024            6.60
2048            8.30
4096            11.02
So this behavior does not seem reasonable to us.

2. SOMETIMES, after the test with overall 32 processes, one of the four
processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test
program shows a "done." and waits for sometime. We can neither kill the
process nor soft reboot the node. We have to wait for that process to
terminate, which can last long.  

Does anybody have some comments in these issues? 
Thanks in advance,
Tahir Malas
Bilkent University 
Electrical and Electronics Engineering Department

More information about the Beowulf mailing list