[Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process

Tue Jun 12 08:14:55 PDT 2007

> For 32 processes (4 process per node), the arrays with 512-Byte size are
> communicated slower than the 4096-Byte size arrays. For both of them, we

do you mean that this is not the case in other configurations?
an interconnect _should_ have some steep rise in effective bandwidth
as packet size is increased.  it's a useful metric to know the packet
size at which half-peak bandwidth is achieved, since this offers some
"sense of scale" to programmers judging whether their own packet sizes
are appropriate.

> this abnormal case is persistent. More specifically, communication of
> 4k-Byte packages are 2 times faster than the communication of 512-Byte
> packages.

perhaps I'm dense this morning, but what's unexpected about that?

> The OSU bandwidth and latency test around these points shows:
> Byte			MB/s
> 256             417.53
> 512             592.34
> 1024            691.02
> 2048            857.35
> 4096            906.04
> 8192            1022.52

the osu_bw test is a streaming, fire-and-forget one which strongly
rewards message aggregation.  (this is not necessarily deceptive - 
it's measuring a real communication pattern, though it's not the 
only way to quantify bandwidth.)  you can see that it's aggregating
because the reported bandwidth for small packets is much higher than
you'd expect if each packet took the latency reported below.
(unless my math is wrong, 256/(2*4.79e-6) = 26.7 MB/s)

> 		Time (usec)
> 256		    4.79
> 512             5.48
> 1024            6.60
> 2048            8.30
> 4096            11.02
> So this behavior does not seem reasonable to us.
>
> 2. SOMETIMES, after the test with overall 32 processes, one of the four
> processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test
> program shows a "done." and waits for sometime. We can neither kill the
> process nor soft reboot the node. We have to wait for that process to
> terminate, which can last long.

does /proc/$pid/wchan (on the 'D' state process) tell you anything?
do all the ranks return from MPI_Finalize?

regards, mark hahn.