[Beowulf] emergent behavior - correlation of job end times
mathog at caltech.edu
Tue Jul 24 11:52:49 PDT 2018
Thought some of you might find this interesting.
Using the WGS (aka CA aka Celera) genome assembler there is a step which
runs a large number (in this instance, 47634) of overlap comparisons.
There are N sequences (many millions, of three different types) and it
makes many sequence ranges and compares them pairwise, like 100-200 vs.
1200-1300. There is a job scheduler that keeps 40 jobs going at all
times. However, during a run jobs are independent, they do not
communicate with the others or with the job controller.
The initial observation was that "top" showed a very nonrandom
distribution of elapsed times. Large numbers of jobs (20 or 30)
appeared to have correlated elapsed times. So the end times for the
jobs were determined and these were stored in a histogram with 1 minute
wide bins. When plotted it shows the job end times clumping up, and
what could be beat frequencies. I did not run this through any sort of
autocorrelation analysis but the patterns are easily seen by eye when
plotted. See for instance the region around 6200-6400. The patterns
evolve over time, possibly because of differences in the regions of
data. (Note, a script was changed around minute 2738, so don't compare
patterns before that with patterns after it.) The jobs were all running
single threaded and they were pretty much nailed at 99.9% CPU usage
except when they started up or shut down. Each wrote its output through
a gzip process to a compressed file, and they all seemed to be writing
more or less all the time. However the gzip processes used a negligible
fraction of the CPU time.
That histogram data is in end_times_histo.txt.gz on the 6th or so post
The subrange data for the jobs is in ovlopt.gz.
So, the question is, what might be causing the correlation of the job
The start times were also available and these do not indicate any
induced "binning". That is, the controlling process isn't waiting for a
long interval to pass and then starting a bunch of jobs all at once.
Probably it is spinning on a wait() with 1 second sleep() [because it
uses no CPU time] and starts the next job as soon as one exits.
One possibility is that at the "leading" edge the first job that reads a
section of data will do so slowly, while later jobs will take the same
data out of cache. That will lead to a "peloton" sort of effect, where
the leader is slowed and the followers accelerated. iostat didn't show
very much disk IO though.
Another possibility is that the jobs are fighting for memory cache (each
is many Gb in size) and that somehow or other also syncs them.
My last guess is that the average run times in a given section of data
may be fairly constant, and that with a bit of drift in some parts of
the run they became synchronized by chance.
The extent of synchronization seems too high though, around 6500 minutes
half the jobs are ending at about the same time, and it was like that
for around 1000 minutes.
Is this sort of thing common? What else could cause it?
System info: Dell PowerEdge T630, Centos 6.9, CPU Xeon E5-2650 as 2 CPUS
with 10 cores/CPU and 2 threads/core for 40 "CPUs", NUMA with even cpus
on node0 and odd on node1, 512Gb RAM, RAID5 with 4 disks for 11.7Tb.
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf