[Beowulf] IEEE 1588 (PTP) - a better cluster clock?

Wed Jul 18 08:35:36 PDT 2007

Hello folks,

I would like to take the chance and discuss with a broader audience one
of the long-standing problems in building Beowulf clusters and present a
potential solution for it: NTP to synchronize the system times of
different nodes just doesn't cut it. To be fair, it was never designed
to achieve the accuracy that you would like to have in a cluster.

For clusters it's time to replace it with a solution that works better
in a LAN. The good news is that there is already one: the Precision Time
Protocol (PTP) IEEE 1588 standard [1]. David Lombard and I have
investigated the applicability of that standard and its existing open
source implementation PTPd [2] for HPC clusters; so far it looks very
promising.

[1] http://ieee1588.nist.gov/
[2] http://ptpd.sourceforge.net/

Those of you who were at this year's LCI conference might remember my
speaker's corner presentation and the ensuing discussions that we had
there. For those who were not there, let me summarize:
     1. NTP only achieves an accuracy of ~1ms, which is several orders
        higher than the latency of messages that one might want to
        measure, so the measuring error for individual events is way to
        high.
     2. The frequency of clocks in typical hardware varies in a
        non-linear way that linear clock correction methods cannot
        compensate for; as one of the main developers of an MPI tracing
        library (Vampirtrace/Intel Trace Collector) I have struggled
        with that for a long time. There is a non-linear clock
        correction in the tracing library which helps, but it only works
        for applications that regularly invoke an API call - not very
        user-friendly.
     3. PTP as implemented by PTPd works in user space, requires no
        special hardware and achieves +/-1us accuracy; this is from a
        real measurement with NTP running on a head node and PTPd
        running on two compute nodes.

At LCI we asked the audience a few questions to get a feeling what the
community was thinking about the problem and I'd like to repeat the
questions here:
      * Have you heard of PTP and considered to use it in clusters?
      * How would applications or clusters benefit from a better
        cluster-wide clock?
      * What obstacles did or could prevent using PTP(d) for that
        purpose?

It turned out that no-one had heard of PTP before; so my hope is that it
is new to you also and I'm not boring you to death with an old hat...

Not having an accurate cluster clock was considered annoying by most
people. Comparing log files from different nodes and performance
analysis obviously benefit from higher accuracy, whereas applications
typically do not depend on it. This is perhaps a chicken-and-egg
problem: because timing is inaccurate, no-one considers algorithms which
depend on it, so no-one builds clusters which have a better clock
because the demand is not there.

Regarding obstacles we had a very good discussion with Don Becker which
went into the details of how PTPd implements PTP: an essential part of
PTP is the time stamping of multicast packets as close as possible to
the actual point in time when they go onto the wire or are received by a
host. In PTPd this is done by asking the IP stack to time stamp incoming
packages in the kernel using a standard interface for that. Outgoing
packages are looped back and the "incoming" time stamp on the loop back
device is assumed to be close to the point in time when the same packet
also left the node. This assumption can be off by a considerable delta
and worse, due to other traffic and queues in Ethernet hardware/drivers
the delta can vary. This is expected to lead to noise in the
measurements and reduced accuracy.

Another problem with PTP is scaling: all communication in the original
2002 revision of the standard (which PTPd is currently based on) uses
multicasts, even for the client to master communication which could be
done with point-to-point communication. This was apparently a deliberate
decision to simplify the implementation of the standard in embedded
devices. The effect is that for every packet in a subnet all nodes
receive the PTP packet and wake up the PTPd daemon even if that daemon
just discards the packet. On the bright side one can extrapolate from
the frequency of these packets (master to clients: two packets every two
seconds; each client to master: two packets in random intervals from 4
to 60 seconds) that the packet rate is still <1000/s for 10000 nodes -
this should be fairly scalable.

There are obvious ideas for avoiding some of these shortcomings (using
point-to-point communication for client->master packets; putting the PTP
implementation into the kernel) but it's not clear who will pick up that
work and what the benefit would be. The PTPd source is available and
although I don't know for sure, I'd expect that the author would be
happy to accept patches. I mailed him today and told him that we are
looking at PTPd in an HPC context.

There are several next actions that I can imagine besides a general
discussion of this technology:
      * PTP has not been tested at scale yet. I wrote an MPI based
        benchmark program which continuously measures clock offsets
        between all nodes and would be happy to assist anyone who wants
        to try out PTPd on a larger cluster.
      * The author of PTPd is working towards a first stable 1.0
        release. Testing the release candidate might help to get 1.0 out
        and clear the way for including future patches. It might also
        provide more insights about which kind of systems it works or
        doesn't work on.
      * If someone has the time, there are some gaps in PTPd which could
        be filled: it has no init scripts yet; there is a more recent
        IEEE 1588 specification that might address some of the issues
        outlined above; etc.
      * Ideally PTPd should come pre-packaged by Linux distributions or
        at least be added to HPC installations.

Regarding the last point it is worth noting that there are patents on
some of the technology. This probably has to be sorted out before
redistributing binaries will be considered by Linux distributions. The
patents can be found when looking for IEC 61588, which is the same as
IEEE 1588:
      * http://www.iec.ch/tctools/patent_decl.htm
      * http://www.iec.ch/tctools/patents/agilent.htm

Please note that I am not speaking for Intel or any of the involved
companies in this matter and in particular when it comes to patents I
cannot provide any advice.

That's all for now (and probably enough stuff, too ... although perhaps
you prefer detailed emails over bullet items on a PowerPoint
presentation). So what do you think?

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.