[Beowulf] precise synchronization of system clocks

Mon Sep 29 15:00:04 PDT 2008

On Sep 29, 2008, at 4:10 PM, Prentice Bisbal wrote:

> In the previous thread I instigated about running services in cluster
> nodes, there was some mentioning of precisely synchronizing the system
> clocks and this issue is also mentioned in this paper:
>
> "The Case of Missing Supercomputer Performance: Achieving Optimal
> Performance on the 8,192 processor ASCI Q" (Petrini, Kerbisin and  
> Pakin)
> http://hpc.pnl.gov/people/fabrizio/papers/sc03_noise.pdf
>
> I've also read a few other papers on the topic, and it seems you  
> need to
> sync the system clocks to ~1 uS. On top of that, I imagine you also  
> need
> to synch the activities of each system so they all stop to do the same
> system-level tasks at the same time.
>
> The papers I read all mentioned different OSes, or at least  
> specialized
> hardware. Can this level of synchronization be achieved in Linux on
> commodity hardware?  I imagine NTP doesn't have the resolution needed
> for this, and Don Becker has some strong feelings against NTP.

The SiCortex systems I work on are not commodity, but they do run  
Linux.  All the node chips in the machine are frequency locked to the  
same oscillator, so the core cycle counters (MIPS standard) advance at  
the same rate, but because the cores are released from reset at  
different times, they are not initially synchronized. We recently  
added a global clock synchronization step to booting the system by  
timestamping messages sent over an out-of-band channel of the  
interconnect. After some futzing around, we're able to synchronize all  
the cycle counters to within about 50 nanoseconds.  The timer  
interrupts then happen at the same counter values system wide, which  
naturally synchronizes most of the daemons that wake up.  I don't  
think we've gone to the trouble of gang scheduling them as well, which  
would also be a good idea.

We tried reducing the standard 1000Hz timer interrupts to 100 Hz, but  
a bunch of stuff in the IP network stack reacted badly, slowing down  
IP communications.  We  haven't tracked it all down yet.

As one would expect from the papers you cite, the clock  
synchronization has had a very dramatic effect on large scale  
collectives - a 5800 rank 8-byte allreduce is now down to 36  
microseconds, where it was something like 170 microseconds before the  
clock project.

Since clusters built from commodity servers run on independent  
oscillators, it it much harder to synchronize them - NTP will do a  
very good job estimating the relative frequencies, but all those  
oscillators will drift independently with temperature and aging, so  
you have to run NTP continually.

However, the problem to solve - synchronizing local clocks with each  
other, is different from the one NTP is intended to solve.   You don't  
really care what the wall clock time is, you only care that all the  
systems have the same time.

I've seen some other papers on the subject of using LAN timestamps to  
provide  much more accurate local synchronization.  Here's one that  
cites 10 microsecond results:

High-Precision Relative Clock Synchronization Using Time Stamp Counters
Guo-Song Tian; Yu-Chu Tian; Fidge, C.
Engineering of Complex Computer Systems, 2008. ICECCS 2008. 13th IEEE  
International Conference on
Volume , Issue , March 31 2008-April 3 2008 Page(s):69 - 78
>

Incidently, a good way to measure the effects of OS noise locally is  
to write a program that reads the core cycle counter in a tight loop,  
and keeps statistics on the intervals between successive samples.  You  
can find out how often and for how long your OS is going out to lunch.

_larry