[Beowulf] emergent behavior - correlation of job end times

Tue Jul 24 11:52:49 PDT 2018

Hi all,

Thought some of you might find this interesting.

Using the WGS (aka CA aka Celera) genome assembler there is a step which 
runs a large number (in this instance, 47634) of overlap comparisons.  
There are N sequences (many millions, of three different types) and it 
makes many sequence ranges and compares them pairwise, like 100-200 vs. 
1200-1300.  There is a job scheduler that keeps 40 jobs going at all 
times.  However, during a run jobs are independent, they do not 
communicate with the others or with the job controller.

The initial observation was that "top" showed a very nonrandom 
distribution of elapsed times.  Large numbers of jobs (20 or 30) 
appeared to have correlated elapsed times.  So the end times for the 
jobs were determined and these were stored in a histogram with 1 minute 
wide bins.  When plotted it shows the job end times clumping up, and 
what could be beat frequencies.  I did not run this through any sort of 
autocorrelation analysis but the patterns are easily seen by eye when 
plotted.  See for instance the region around 6200-6400.  The patterns 
evolve over time, possibly because of differences in the regions of 
data.  (Note, a script was changed around minute 2738, so don't compare 
patterns before that with patterns after it.)  The jobs were all running 
single threaded and they were pretty much nailed at 99.9% CPU usage 
except when they started up or shut down.  Each wrote its output through 
a gzip process to a compressed file, and they all seemed to be writing 
more or less all the time.  However the gzip processes used a negligible 
fraction of the CPU time.

That histogram data is in end_times_histo.txt.gz on the 6th or so post 
here:

    https://github.com/alekseyzimin/masurca/issues/45

The subrange data for the jobs is in ovlopt.gz.

So, the question is, what might be causing the correlation of the job 
run times?

The start times were also available and these do not indicate any 
induced "binning".  That is, the controlling process isn't waiting for a 
long interval to pass  and then starting a bunch of jobs all at once.  
Probably it is spinning on a wait() with 1 second sleep() [because it 
uses no CPU time] and starts the next job as soon as one exits.

One possibility is that at the "leading" edge the first job that reads a 
section of data will do so slowly, while later jobs will take the same 
data out of cache.  That will lead to a "peloton" sort of effect, where 
the leader is slowed and the followers accelerated.  iostat didn't show 
very much disk IO though.

Another possibility is that the jobs are fighting for memory cache (each 
is many Gb in size) and that somehow or other also syncs them.

My last guess is that the average run times in a given section of data 
may be fairly constant, and that with a bit of drift in some parts of 
the run they became synchronized by chance.
The extent of synchronization seems too high though, around 6500 minutes 
half the jobs are ending at about the same time, and it was like that 
for around 1000 minutes.

Is this sort of thing common? What else could cause it?

System info: Dell PowerEdge T630, Centos 6.9, CPU Xeon E5-2650 as 2 CPUS 
with 10 cores/CPU and 2 threads/core for 40 "CPUs", NUMA with even cpus 
on node0 and odd on node1, 512Gb RAM, RAID5 with 4 disks for 11.7Tb.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech