[Beowulf] emergent behavior - correlation of job end times

Christopher Samuel chris at csamuel.org
Tue Jul 24 19:08:58 PDT 2018

On 25/07/18 04:52, David Mathog wrote:

> One possibility is that at the "leading" edge the first job that
> reads a section of data will do so slowly, while later jobs will take
> the same data out of cache.  That will lead to a "peloton" sort of
> effect, where the leader is slowed and the followers accelerated.
> iostat didn't show very much disk IO though.

I have to admit that was my first thought too. I also started to
speculate about power saving but I couldn't see a way there for
later jobs to catch up enough.

One fun thing would be to turn HT off and set the scheduler to
run 20 jobs at a time and see if it still happens then.

Perhaps running this step with "perf record" to try and capture
profile data and then look to see if you can spot differences
across all the runs?   Not sure if there are scripts to do that,
or how easy it would be to rig up (plus of course the extra I/O
of recording the traces will perturb the system).

A very interesting problem!

All the best,
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

