[Beowulf] IEEE 1588 (PTP) - a better cluster clock?

Andrew Shewmaker agshew at gmail.com
Tue Jul 24 15:51:19 PDT 2007

On 7/18/07, Patrick Ohly <patrick.ohly at intel.com> wrote:
>       * Have you heard of PTP and considered to use it in clusters?
>       * How would applications or clusters benefit from a better
>         cluster-wide clock?
>       * What obstacles did or could prevent using PTP(d) for that
>         purpose?

I wasn't aware of PTPd, and neither was my team leader Josip Loncaric.
Now that I am, I'll try it out and compare it to a solution Josip came up
with a while ago.  He was satisfied with NTP at one time, but started
writing BTime (http://btime.sf.net) in 2004.  He listed some of his reasons
in a talk at SC2005 (slides are in the BTime tarball):

Benefits of precision global timekeeping
 ­ Rapid/frequent performance measurements of parallel applications
    · Local gettimeofday() better than communicating with a global clock
 ­ Synchronous system activity possible without communication
    · Local timer triggered events could be globally synchronous
    · Potential for reducing the impact of system noise

BTime synchronizes client clocks to server broadcasts (not multicast),
and uses a kernel module to provide more precise time-relevant data.
The current version of BTime applies to Linux kernels 2.6.13 up to and
including 2.6.17, but Josip hasn't had time to get it working with the new
clocksource infrastructure of newer kernels.

More details from the README:

TUNING: BTime assumes that a certain fraction of timestamps will make it with
minimal delays, and that those minimal delays are exponentially
distributed.  Over
a high performance local network using UDP protocol, this
characteristic noise is
empirically about 3 microseconds (it would be about 10 microseconds
for TCP), but
if the network path has several hops, timing noise could be higher
(e.g. 25 us UDP).
BTW, BTime adaptively estimates the probability of receiving timestamps without
extra delays; but it currently requires a fixed timing noise estimate.

TO DO: broadcast delay compensation...  for now, BTime synchronizes
all clients to server time minus uncompensated broadcast delay B.  This delay
(about 35-50 microseconds) can be measured more precisely to improve
compensation. Otherwise, the server will remain B microseconds ahead
of all clients, which will be synchronized with each other.  BTime
currently applies the same compensation constant at each client.

Quality of this synchronization depends on the consistency in the minimum
broadcast time, but client clocks usually remain within 10 microseconds of
each other.  Even over a noisy 4-hop network, 10 us tolerance was reached
with 99.75% confidence in my tests at 1 timestamp per second.

Finally, btime-server is set to send a small UDP packet once per
second.  This imposes
very low overhead (<0.001% of CPU and network) but the interval could
be increased
up to 30 seconds or so, at the expense of increasing the width of the confidence
interval.  In my tests (using TCP with about 10 us timing noise over
private GigE),
client clocks track the median offset within about

  10 microseconds * sqrt(seconds between good timestamps)

as the 99.9% confidence interval.  This assumes that the master clock is
synchronized to wall clock time once per day; but if NTP is running,
it applies >100
times larger adjustments roughly every 17 minutes, and while BTime quickly
compensates, these transients reduce the confidence of staying within
the same tracking interval.

Andrew Shewmaker

More information about the Beowulf mailing list