[Beowulf] Re: dual core (latency)

Mon Jul 18 07:38:48 PDT 2005

I've been toying some with the numactl at dual core and it doesn't 
really seem to help much. It helps 0.00

System: Ubuntu at a quad opteron dual core 1.8Ghz  2.6.10-5 smp kernel.

Latencies as measured by my own program (TLB trashing read of 8 bytes, 
each cpu 250MB buffer):

#cpu latency
1   144-147 ns
2   174 ns
4   206 ns
8   234 ns

That single cpu figure is pretty ugly bad if i may say so.

All kind of numa calls just didn't help a thing. I've tried for example:

  if(numa_available() < 0 ) {
    setitnuma = 0;
  }  
  else {
    int i,back;
    nodemask_t nt,n2,rnm;
    maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
    printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);      

    nt = numa_get_interleave_mask();
    for( i = 0 ; i < maxnodes ; i++ ) {
      printf("node = %i mask = %i\n",i,nt.n[i]);
      nt.n[i] = 0;
      n2.n[i] = 0;
    }
    numa_set_interleave_mask(&nt);        
    nt = numa_get_interleave_mask();
    for( i = 0 ; i < maxnodes ; i++ ) 
      printf("checking memory interleave node = %i mask = %i\n",i,nt.n[i]);

    rnm = numa_get_run_node_mask();
    printf("numa get run node mask = %i\n",rnm);
    back = numa_run_on_node(0);
    if( !back ) 
      printf("set to run on node 0\n");
    else 
      printf("failed to set run on node 0\n");

  }

Whatever i try, single cpu latency keeps 144-147 ns.

A dual opteron dual core with 2.2Ghz dual core controllers shows similar
latencies. 200 ns for example when running 4 processes with the same
testprogram. 

This single cpu latency behaviour of dual core opteron is ugly bad 
compared to other dual opterons which are not dual core. 

Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives single cpu
with SAME kernel, with SAME program 115 ns latency. When turning off ECC at
that dual opteron it gets down to 113 ns even.

The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
whereas the quad opteorn dual core has all banks filled 
with pc3200 registered ram, a-brand.

Vincent

At 07:29 PM 7/14/2005 +0400, Mikhail Kuzminsky wrote:
>In message from "Kozin, I \(Igor\)" <i.kozin at dl.ac.uk> (Thu, 14 Jul 
>2005 11:25:12 +0100):
>>> But now for 4cores/2CPUs per Opteron node to force the using of
>>> only 2 cores (from 4), by 1 for each chip, we'll need to have
>>> cpu affinity support in Linux.
>>
>>Mikhail,
>>you can use "taskset" for that purpose. 
>>For example, (perhaps not in the most elegant form)
>>	mpiexec  -n 1 taskset -c 0 $code : -n 1 taskset -c 2 $code
>>But I doubt you want to let the idle cores to do something else 
>>in the mean time. However small you will generally see an increase 
>>in performance if you use all the cores.
>
>Thanks !
>AFAIK, taskset isn't a part of numactl package. Sorry, where is 
>possible to download taskset ?
>
>What is about using of all the cores, you are, of course, in
>general right ;-)
>But there may be some cases where using of "pairs of cores" on Opteron
>is bad under parallelization. For example, test178/RHF on the same
>G03 w/Linda, gave more worse performance on 2 cores than for 1 core
>(because of extra memory traffic).
>And, again at least theoretically, I may "occupy" 2 (free) of total
>4 cores in 2-chip server for example w/some independed cache-friendly 
>tasks.
>
>Yours
>Mikhail  
>   
>
>Yours
>Mikhail
>
>
>
>>
>>Best,
>>Igor
>>
>>
>>I. Kozin  (i.kozin at dl.ac.uk)
>>CCLRC Daresbury Laboratory
>>tel: 01925 603308
>>http://www.cse.clrc.ac.uk/disco
>>Distributed Computing Forum
>>http://www.cse.clrc.ac.uk/disco/forums/ubbthreads.php?Cat=0
>>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit 
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
 *
 * This program rasml.c measures the Random Average Shared Memory Latency (RASML)
 * Thanks to Agner Fog for his excellent random number generator.
 *
 * This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
 *
 * Created by Vincent Diepeveen who hereby releases this under GPL
 * Feel free to look at the FSF (free software foundation) for what
 * GPL is and its conditions.
 *
 * Please don't confuse the times achieved here with two times the one 
 * way pingpong latency, though at
 * ideal scaling supercomputers/clusters they will be close. There is a few
 * differences:
 *    a) this is TLB trashing
 *    b) this test tests ALL processors at the same time and not
 *       just 2 cpu's while the rest of the entire cluster is idle.
 *    c) this test ships 8 bytes whereas one way pingpong typical also
 *       gets used to test several kilobyte sizes, or just returns a pong.
 *    d) this doesn't use MPI but shared memory and the way such protocols are
 *       implemented matters possibly for latency.
 *
 * Vincent Diepeveen                 diep at xs4all.nl
 * Veenendaal, The Netherlands       10 june 2003
 *
 * First a few lines about the random number generator. Note that I modified Agner Fog's
 * RanRot very slightly. Basically its initialization has been done better and some dead
 * slow FPU code rewritten to fast 64 bits integer code.
 */

#define UNIX 1  /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1  /* this value only matters when UNIX is set to 1. For Linux put to 0
                 * basically allocating shared memory in linux is pretty buggy done in
                 * its kernel.
                 *
                 * Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
                 * and look for yourself how much shared memory YOU can allocate in linux.
                 *
                 * If that is not enough to benchmark this program then try modifying it with:
                 *    echo <newsize> > /proc/sys/kernel/shmmmax
                 * Be sure you are root when doing that each time the system boots.
                 */
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.

#if UNIX
  #include <pthread.h>
  #include <sys/ipc.h>
  #include <sys/shm.h>
  #include <sys/times.h>
  #include <sys/time.h>
  #include <unistd.h>
#else
  #include <windows.h>
  #include <winbase.h> // for GetTickCount()
  #include <process.h> // _spawnl
#endif

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

#define SWITCHTIME      60000   /* in milliseconds. Modify this to let a test run longer or shorter.
                                 * basically it is a good idea to use about the cpu number times
                                 * thousand for this. 30 seconds is fine for PC's, but a very
                                 * bad idea for supercomputers. I recomment several minutes
                                 * there, and at least a few hours for big supers if the partition isn't started yet
                                 * if the partition is started starting it at 460 processors (SGI) should 
                                 * take 10 minutes, otherwise it takes 3 hours to attach all. 
                                 * Of course that let's a test take way way longer.
                                 */
#define MAXPROCESSES     512    /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH  128    /* cache line length at the machine. Modify this if you want to */

#if UNIX
  #include <time.h>
 // #include <memory.h>
  #define FORCEINLINE       __inline
  /* UNIX and such this is 64 bits unsigned variable: */
  #define BITBOARD                     unsigned long long
#else
  #define FORCEINLINE       __forceinline
  /* in WINDOWS we also want to be 64 bits: */
  #define BITBOARD                     unsigned _int64
#endif

#define     STATUS_NOTSTARTED    0
#define     STATUS_ATTACH        1
#define     STATUS_GOATTACH      2
#define     STATUS_ATTACHED      3
#define     STATUS_STARTREAD     4
#define     STATUS_READ          5
#define     STATUS_MEASUREREAD   6
#define     STATUS_MEASUREDREAD  7

#define     STATUS_QUIT         10

struct ProcessState {
  volatile int status; /*  0  = not started yet
                        *  1  = ready to start reading
                        *
                        *  10 = quitted
                        * */

  /* now the numbers each cpu gathers. The name of the first number is what
   * cpu0 is doing and the second name what all the other cpu's were doing at that
   * time
   */
  volatile BITBOARD readread; /* */
  char dummycacheline[CACHELINELENGTH];
};

typedef struct {
  BITBOARD nentries; // number of entries of 64 bits used for cache.
  struct ProcessState ps[MAXPROCESSES];
} GlobalTree;

void     RanrotAInit(void);
float    ToNano(BITBOARD);
int      GetClock(void);
float    TimeRandom(void);

void     ParseBuffer(BITBOARD);
void     ClearHash(void);
void     DeAllocate(void);
int      DoNrng(BITBOARD);
int      DoNreads(BITBOARD);
int      DoNreadwrites(BITBOARD);
//void     TestLatency(float);
int      AllocateTree(void);
void     InitTree(int);
void     WaitForStatus(int,int);
void     PutStatus(int,int);
int      CheckStatus(int,int);
int      CheckAllStatus(int,int);
void     Slapen(int);
float    LoopRandom(void);

/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK  17
#define JJ  10
#define R1   5
#define R2   3

/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
 0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
 0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
 0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
 0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
 /*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2;          /* indexes into history buffer */

/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
    cpus;  // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];

 /******************************************************** AgF 1999-03-03 *
 *  Random Number generator 'RANROT' type B                               *
 *  by Agner Fog                                                          *
 *                                                                        *
 *  This is a lagged-Fibonacci type of random number generator with       *
 *  rotation of bits.  The algorithm is:                                  *
 *  X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b               *
 *                                                                        *
 *  The last k values of X are stored in a circular buffer named          *
 *  randbuffer.                                                           *
 *                                                                        *
 *  This version works with any integer size: 16, 32, 64 bits etc.        *
 *  The integers must be unsigned. The resolution depends on the integer  *
 *  size.                                                                 *
 *                                                                        *
 *  Note that the function RanrotAInit must be called before the first    *
 *  call to RanrotA or iRanrotA                                           *
 *                                                                        *
 *  The theory of the RANROT type of generators is described at           *
 *  www.agner.org/random/ranrot.htm                                       *
 *                                                                        *
 *************************************************************************/

FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}

/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
  /* generate next random number */
  BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
  /* rotate list pointers */
  if( --r_p1 < 0)
    r_p1 = KK - 1;
  if( --r_p2 < 0 )
    r_p2 = KK - 1;
  return x;
}

/* this function initializes the random number generator.      */
void RanrotAInit(void) {
  int i;

  /* one can fill the randbuffer here with possible other values here */
  randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
  randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);

  /* initialize pointers to circular buffer */
  r_p1 = 0;
  r_p2 = JJ;

  /* randomize */
  for( i = 0; i < 300; i++ )
    (void)RanrotA();
}

/* Now the RASML code */
char *To64(BITBOARD x) {
  static char buf[256];
  char *sb;

  sb = &buf[0];
  #if UNIX
  sprintf(buf,"%llu",x);
  #else
  sprintf(buf,"%I64u",x);
  #endif
  return sb;
}

int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
 * to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
 * for linux or unix we can only guess. Too many experts there.
 */
  #if UNIX
  struct timeval timeval;
  struct timezone timezone;
  gettimeofday(&timeval, &timezone);
  return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
  #else
  return((int)GetTickCount());
  #endif
}

float ToNano(BITBOARD nps) {
  /* convert something from times a second to nanoseconds.
   * NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
   * SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
   * NOT MODIFY THIS CODE */
  float tn;
  tn = 1000000000/(float)nps;
  return tn;
}

float TimeRandom(void) {
  /* timing the random number generator is very easy of course. Returns
   * number of random numbers a second that can get generated
   */
  BITBOARD bb=0,i,value,nps;
  float ns_rng;
  int t1,t2,took;

  printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
  printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
  printf("Please wait a few seconds.. "); fflush(stdout);
  value = 100000;
  took  = 0;
  while( took < 3000 ) {
    value <<= 2; //  x4
    t1 = GetClock();

    for( i = 0; i < value; i++ ) {
      bb ^= RanrotA();
    }
    t2 = GetClock();
    took = t2-t1;
  }

  nps = (1000*value)/(BITBOARD)took;

  #if UNIX
  printf("..took %i milliseconds to generate %llu numbers\n",took,value);
  printf("Speed of RNG = %llu numbers a second\n",nps);
  #else
  printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
  printf("Speed of RNG = %I64u numbers a second\n",nps);
  #endif

  ns_rng = ToNano(nps);
  printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);

  return ns_rng;
}

void ParseBuffer(BITBOARD nbytes) {
  tree->nentries = nbytes/sizeof(BITBOARD);
  #if UNIX
  printf("Trying to allocate %llu entries. ",tree->nentries);
  printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
  #else
  printf("Trying to allocate %s entries. ",To64(tree->nentries));
  printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
  #endif
}

void ClearHash(void) {
  BITBOARD *hi,i,nentries = tree->nentries;
  /* clearing hashtable */
  printf("Clearing hashtable for processor %i\n",ProcessNumber);
  fflush(stdout);
  hi = hashtable[ProcessNumber];
  for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
    hi[i] = i;
}

void DeAllocate(void) {
  int i;
  #if UNIX
  shmctl(shm_tree,IPC_RMID,0);
  for( i = 0; i < cpus; i++ ) {
    shmctl(shm_hash[i],IPC_RMID,0);
  }
  #else
  UnmapViewOfFile(tree);
  for( i = 0; i < cpus; i++ ) {
    UnmapViewOfFile(hashtable[i]);
  }
  #endif
}

int DoNrng(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= (index+(BITBOARD)i2);
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;
  return(t2-t1);
}

int DoNreads(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2,ncpu;

  ncpu     = cpus;
  nents    = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD rani=RanrotA(),index=rani%nents;
    unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
    dummyres ^= hashtable[i2][index];
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

#if 0
int DoNreadwrites(BITBOARD n) {
  BITBOARD i=1,dummyres,nents;
  int t1,t2;

  nents = nentries; /* hopefully this gets into a register */
  dummyres = globaldummy;

  t1 = GetClock();
  do {
    BITBOARD index = RanrotA()%nents;
    dummyres ^= hashtable[index];
    hashtable[index] = dummyres;
  } while( i++ < n );
  t2 = GetClock();

  globaldummy = dummyres;

  return(t2-t1);
}

void TestLatency(float ns_rng) {
  BITBOARD n,nps_read,nps_rw,nps_rng;
  float ns,fns;
  int timetaken;

  printf("Doing random RNG test. Please wait..\n");
  n = 50000000; // 50 mln
  timetaken = DoNrng(n);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);

  /* READING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random read tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreads(n);
  nps_read = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_read);
  printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);

  /* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
  printf("Doing random readwrite tests single cpu. Please wait..\n");
  n = 100000000; // 100 mln
  timetaken = DoNreadwrites(n);
  nps_rw = (1000*n) / (BITBOARD)timetaken;
  ns  = ToNano(nps_rw);
  printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
  printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);

  printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif

int AllocateTree(void) { /* initialize the tree. returns 0 if error */
  #if UNIX
  shm_tree = shmget(
              ftok(".",'t'),
              sizeof(GlobalTree),IPC_CREAT|0777);
  if( shm_tree == -1 )
    return 0;
  tree = (GlobalTree *)shmat(shm_tree,0,0);
  if( tree == (GlobalTree *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  if( !ProcessNumber ) {
    HANDLE TreeFileMap;
    TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)sizeof(GlobalTree),"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  else { /* Slaves attach also try to attach to the tree */
    HANDLE TreeFileMap;
    TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
    if( TreeFileMap == NULL )
      return 0;
    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( tree == NULL )
      return 0;
  }
  #endif
  return 1;
}

int AttachAll(void) {
  #if UNIX
  #else
  HANDLE HashFileMap;
  #endif
  char hashname2[32] = {"RASM_Hash00"},hashname[32];
  int i,r;
  for( r = 0; r < cpus; r++ ) {
    i = ProcessNumber+r;
    i %= cpus;
    if( i == ProcessNumber )
      continue;
    #if UNIX
    shm_hash[i] = shmget(
              #if IRIX
              ftok(".",200+i),
              #else
              ftok(".",(char)i),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
    if( shm_hash[i] == -1 )
      return 0;
    hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
    if( hashtable[i] == (BITBOARD *)-1 )
      return 0;
    #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */

    strcpy(hashname,hashname2);
    hashname[9] += (i/10);
    hashname[10] += (i%10);

    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[i] == NULL )
      return 0;
    #endif
  }
  return 1;
}

int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
  char hashname[32] = {"RASM_Hash00"};
  #if UNIX
  shm_hash[ProcessNumber] = shmget(
              #if IRIX
              ftok(".",200+ProcessNumber),
              #else
              ftok(".",(char)ProcessNumber),
              #endif
              tree->nentries*8,IPC_CREAT|0777);
  if( shm_hash[ProcessNumber] == -1 )
    return 0;
  hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
  if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
    return 0;
  #else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
  //if( !ProcessNumber ) {
    HANDLE HashFileMap;

    hashname[9] += (ProcessNumber/10);
    hashname[10] += (ProcessNumber%10);

    HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
     (DWORD)tree->nentries*8,hashname);
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;
  //}
  //else { /* Slaves attach also try to attach to the tree */
 /*   HANDLE HashFileMap;
    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
    if( HashFileMap == NULL )
      return 0;
    hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
    if( hashtable[ProcessNumber] == NULL )
      return 0;*/
  //}
  #endif
  return 1;
}

int StartProcesses(int ncpus) {
  char buf[256];
  int i;
  /* returns 1 if ncpus-1 started ok */
  if( ncpus == 1 )
    return 1;

  for( i = 1 ; i < ncpus ; i++ ) {
    sprintf(buf,"%i_%i",i+1,ncpus);
    #if UNIX
    if( !fork() )
      execl(rasmexename,rasmexename,buf,NULL);
    #else
    (void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
     #endif
  }
  return 1;
}

void InitTree(int ncpus) {
  int i;

  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status   = STATUS_NOTSTARTED;
    tree->ps[i].readread = 0;
  }
}

void WaitForStatus(int ncpus,int waitforstate) {
  /* wait for all processors to have the same state */
  int i,badluck=1;

  while( badluck ) {
    badluck = 0;
    for( i = 0 ; i < ncpus ; i++ ) {
      if( tree->ps[i].status != waitforstate )
        badluck = 1;
    }
  }
}

void PutStatus(int ncpus,int statenew) {
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    tree->ps[i].status = statenew;
  }
}

int CheckStatus(int ncpus,int statenew) {
  /* returns false when not all cpu's are in the new state */
  int i;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != statenew )
      return 0;
  }
  return 1;
}

int CheckAllStatus(int ncpus,int status) {
  /* Tries with a single loop to determine whether the other cpu's also finished
   *
   * returns:
   *     true  ==> when all the processes have this status
   *     false ==> when 1 or more are still busy measuring
   */
  int i,badluck=1;
  for( i = 0 ; i < ncpus ; i++ ) {
    if( tree->ps[i].status != status ) {
      badluck = 0;
      break;
    }
  }
  return badluck;
}

void Slapen(int ms) {
  #if UNIX
  usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
  #else
  Sleep(ms);     /* 0.050 seconds, it is in milliseconds */
  #endif
}

float LoopRandom(void) {
  BITBOARD n,nps_rng;
  float fns;
  int timetaken;
  printf("Benchmarking random RNG test. Please wait..\n");
  n = 25000000; // 50 mln
  timetaken = 0;
  while( timetaken < 500 ) {
    n += n;
    timetaken = DoNrng(n);
  }
  printf("timetaken=%i\n",timetaken);
  nps_rng = (1000*n) / (BITBOARD)timetaken;
  fns  = ToNano(nps_rng);
  printf("Machine needs %f ns for RND loop\n",fns);
  return fns;
}

/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
  /* allocate a big memory buffer parameter is in bytes.
   * don't hesitate to MODIFY this to how many gigabytes
   * you want to try.
   * The more the better i keep saying to myself.
   *
   * Note that under linux your maximum shared memory limit can be set with:
   *
   * echo <size> > /proc/sys/kernel/shmmax
   *
   * and under IRIX it is usually 80% from the total RAM onboard that can get allocated
   */

  BITBOARD nbytes,firstguess;
  float ns_rng,f_loop;
  int tottimes,t1,t2;

  if( argc <= 1 ) {
    printf("Latency test usage is: latency <buffer> <cpus>\n");
    printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
    printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
    return 1;
  }

  /* parse the input */
  nbytes = 0;
  cpus   = 1; // default

  if( strchr(argv[1],'_') == NULL ) { /* main startup process */
    int np = 0;
    #if UNIX
     #if FREEBSD
     nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
     #else
     nbytes = (BITBOARD)atoll(argv[1]);
     #endif
    #else
    nbytes = (BITBOARD)_atoi64(argv[1]);
    #endif

    printf("Welcome to RASM Latency!\n");
    printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");

    if( argc > 2 ) {
      cpus = 0;
      do {
        cpus *= 10;
        cpus += (int)(argv[2][np]-'1')+1;
        np++;
      } while( argv[2][np] >= '0' && argv[2][np] <= '9' );
    }
    //printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
    ProcessNumber = 0;

    /* check whether we are not getting out of bounds */
    if( cpus > MAXPROCESSES ) {
      printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
      return 1;
    }

    /* find out the file name */
    #if UNIX
    strcpy(rasmexename,argv[0]);
    #else
    GetModuleFileName(NULL,rasmexename,2044);
    #endif
    printf("Stored in rasmexename = %s\n",rasmexename);
  }
  else { //   latency 2_452  ==>  means processor 2 out of 452.
    int np = 0;

    ProcessNumber = 0;
    do {
      ProcessNumber *= 10;
      ProcessNumber += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );

    ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]

    np++; // skip underscore

    cpus = 0;
    do {
      cpus *= 10;
      cpus += (argv[1][np]-'1')+1;      // n
      np++;
    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );
    //printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
  }

  /* first we setup the random number generator. */
  RanrotAInit();

  /* initialize shared memory tree; it gets used for communication between the processes */
  if( !AllocateTree() ) {
    printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
    return 1;
  }

  if( !ProcessNumber )
    ParseBuffer(nbytes);

  nentries = tree->nentries;

  /* Now some stuff only the Master has to do */
  if( !ProcessNumber ) {
    /* Master: now let's time the pseudo random generators speed in nanoseconds a call */
    ns_rng = TimeRandom();
    f_loop = LoopRandom();

    printf("Trying to Allocate Buffer\n");
    t1 = GetClock();
    if( !AllocateHash() ) {
      printf("Error: Could not allocate buffer!\n");
      return 1;
    }
    t2 = GetClock();
    printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
    ClearHash(); // local hash
    t1 = GetClock();
    printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);

    /* so now hashtable is setup and we know quite some stuff. So it is time to
     * start all other processes */
    InitTree(cpus);

    printf("Starting Other processes\n");
    t1 = GetClock();
    if( !StartProcesses(cpus) ) {
      printf("Error: Could not start processes\n");
      DeAllocate();
    }
    t2 = GetClock();
    printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();
  }
  else { /* all Slaves do this */
    if( !AllocateHash() ) {
      printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
      return 1;
    }
    ClearHash(); // local hash
  }

  tree->ps[ProcessNumber].status = STATUS_ATTACH;
  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACH);
    t2 = GetClock();
    printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
    t1 = GetClock();

    /* now we can continue with the next phase that is attaching all the segments */
    PutStatus(cpus,STATUS_GOATTACH);
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
      Slapen(500);
    }
  }

  if( !AttachAll() ) {
    printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
    return 1;
  }
  tree->ps[ProcessNumber].status = STATUS_ATTACHED;

  if( ! ProcessNumber ) {
    WaitForStatus(cpus,STATUS_ATTACHED);
    t2 = GetClock();
    printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
    PutStatus(cpus,STATUS_STARTREAD);
    printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
     (SWITCHTIME/1000),(SWITCHTIME%1000));
  }
  else {
    while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
      Slapen(500);
    }
  }

  tree->ps[ProcessNumber].status = STATUS_READ;

  firstguess = 200000;
  tottimes   = 0;
  for( ;; ) {
    int timetaken = 0;
    if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
      /* this really MEASURES the readread */
      BITBOARD ntried = 0,avnumber;
      int totaltime=0;
      while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
        totaltime += DoNreads(firstguess);
        ntried += firstguess;
      }
      /* now put the average number of readreads into the shared memory */
      avnumber = (ntried*1000) / (BITBOARD)totaltime;
      tree->ps[ProcessNumber].readread = avnumber;

      /* show that it is finished */
      tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;

      /* now keep doing the same thing until status gets modified */
      while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
        (void)DoNreads(firstguess);
        if( !ProcessNumber ) {
          if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
            PutStatus(cpus,STATUS_QUIT);
            break;
          }
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
      BITBOARD nextguess;
      /* now software must try to determine how many reads a seconds are possible for that
       * process
       */
      //printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
      timetaken = DoNreads(firstguess);
      /* try to guess such that next test takes 1 second, or if test was too inaccurate
       * then double the number simply. also prevents divide by zero error ;)
       */
      if( timetaken < 400 )
        nextguess = firstguess*2;
      else
        nextguess = (firstguess*1000)/(BITBOARD)timetaken;
      firstguess = nextguess;
      if( !ProcessNumber ) {
        tottimes += timetaken;
        if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
          tottimes = 0;
          if( CheckStatus(cpus,STATUS_READ) ) {
            PutStatus(cpus,STATUS_MEASUREREAD);
          } /* waits another SWITCH time before starting to measure */
        }
      }
    }
    else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
      break;
  }

  /* now do the latency tests
   */
  //TestLatency(ns_rng);
  tree->ps[ProcessNumber].status = STATUS_QUIT;
  if( !ProcessNumber ) {
    BITBOARD averagereadread;
    int i;
    averagereadread = 0;
    WaitForStatus(cpus,STATUS_QUIT);
    printf("the raw output\n");
    for( i = 0; i < cpus ; i++ ) {
      BITBOARD tr=tree->ps[i].readread;
      averagereadread += tr;
      printf("%llu ",tr);
    }
    printf("\n");
    averagereadread /= (BITBOARD)cpus;
    printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
    printf("Now for the final calculation it gets compensated:\n");
    printf("  Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
  }

  DeAllocate();
  return 0;
}

/* EOF latencyC.c */