[Beowulf] Opteron memory rank limits with DDR-400
Vincent Diepeveen
diep at xs4all.nl
Wed Jul 27 14:12:54 PDT 2005
Quad opteron dual core 1.8Ghz
Dmesg gives:
"AMD Opteron(tm) Processor 865 stepping 00"
All 16 banks filled with 256 registered+ecc PC3200 memory.
How do i check what clock it runs the memory?
Latency timings as measured with 250MB ram a cpu (so that's 2 GB with 8
cores):
1 cpu : 144-147 ns
2 cpu's : 174 ns
4 cpu's : 206 ns
8 cpu's : 234 ns
To test it with this program do:
gcc -O2 -o lat latencylinux.c
./lat 250000000 // single cpu eating 250MB
./lat 250000000 2 // dual eating 500MB in total
./lat 250000000 4 // quad
./lat 250000000 8 // 8 cpu's
etc.
confirmed working till 500 cpu's.
At 10:26 AM 7/27/2005 -0600, Josip Loncaric wrote:
>Hello,
>
>Can anyone confirm that Opteron processors Rev. E and later can operate
>four dual-rank 2GB memory modules (8 ranks total) at full DDR-400 speed?
>
>AMD used to recommend no more than 4 ranks of DDR-400 memory. See
>http://forums.amd.com/lofiversion/index.php/t39745.html where the
>relevant quote from AMD technical service reads:
>
>"AMD does recommend to downclock the memory of the system to 333MHz,
>if more than 4 ranks is used in the DIMM slots. What this means is
>that only 2 sticks of 2 rank memory is recommended to run at the full
>400MHz or 4 sticks of 1 rank memory. There is a memory timing issue
>with more than 4 ranks of memory, which is a limitation of the memory
>controller on the Opteron chips."
>
>In the past, this downclocking was automatically enforced by some
>BIOSes, but supposedly there is no need to do so with currently shipping
>Opteron Rev. E and later, provided that the motherboard also allows full
>8 ranks at DDR-400.
>
>I'd just like to be sure... Also, has anyone observed increased memory
>latency with dual-rank modules?
>
>Sincerely,
>Josip
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
/*-----------------10-6-2003 3:48-------------------*
*
* This program rasml.c measures the Random Average Shared Memory Latency (RASML)
* Thanks to Agner Fog for his excellent random number generator.
*
* This testset is using a 64 bits optimized RNG of Agner Fog's ranrot generator.
*
* Created by Vincent Diepeveen who hereby releases this under GPL
* Feel free to look at the FSF (free software foundation) for what
* GPL is and its conditions.
*
* Please don't confuse the times achieved here with two times the one
* way pingpong latency, though at
* ideal scaling supercomputers/clusters they will be close. There is a few
* differences:
* a) this is TLB trashing
* b) this test tests ALL processors at the same time and not
* just 2 cpu's while the rest of the entire cluster is idle.
* c) this test ships 8 bytes whereas one way pingpong typical also
* gets used to test several kilobyte sizes, or just returns a pong.
* d) this doesn't use MPI but shared memory and the way such protocols are
* implemented matters possibly for latency.
*
* Vincent Diepeveen diep at xs4all.nl
* Veenendaal, The Netherlands 10 june 2003
*
* First a few lines about the random number generator. Note that I modified Agner Fog's
* RanRot very slightly. Basically its initialization has been done better and some dead
* slow FPU code rewritten to fast 64 bits integer code.
*/
#define UNIX 1 /* put to 1 when you are under unix or using gcc a look like compilers */
#define IRIX 1 /* this value only matters when UNIX is set to 1. For Linux put to 0
* basically allocating shared memory in linux is pretty buggy done in
* its kernel.
*
* Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
* and look for yourself how much shared memory YOU can allocate in linux.
*
* If that is not enough to benchmark this program then try modifying it with:
* echo <newsize> > /proc/sys/kernel/shmmmax
* Be sure you are root when doing that each time the system boots.
*/
#define FREEBSD 0 // be sure to not use more than 2 GB memory with freebsd with this test. sorry.
#if UNIX
#include <pthread.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/times.h>
#include <sys/time.h>
#include <unistd.h>
#else
#include <windows.h>
#include <winbase.h> // for GetTickCount()
#include <process.h> // _spawnl
#endif
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define SWITCHTIME 60000 /* in milliseconds. Modify this to let a test run longer or shorter.
* basically it is a good idea to use about the cpu number times
* thousand for this. 30 seconds is fine for PC's, but a very
* bad idea for supercomputers. I recomment several minutes
* there, and at least a few hours for big supers if the partition isn't started yet
* if the partition is started starting it at 460 processors (SGI) should
* take 10 minutes, otherwise it takes 3 hours to attach all.
* Of course that let's a test take way way longer.
*/
#define MAXPROCESSES 512 /* this test can go up to this amount of processes to be tested */
#define CACHELINELENGTH 128 /* cache line length at the machine. Modify this if you want to */
#if UNIX
#include <time.h>
// #include <memory.h>
#define FORCEINLINE __inline
/* UNIX and such this is 64 bits unsigned variable: */
#define BITBOARD unsigned long long
#else
#define FORCEINLINE __forceinline
/* in WINDOWS we also want to be 64 bits: */
#define BITBOARD unsigned _int64
#endif
#define STATUS_NOTSTARTED 0
#define STATUS_ATTACH 1
#define STATUS_GOATTACH 2
#define STATUS_ATTACHED 3
#define STATUS_STARTREAD 4
#define STATUS_READ 5
#define STATUS_MEASUREREAD 6
#define STATUS_MEASUREDREAD 7
#define STATUS_QUIT 10
struct ProcessState {
volatile int status; /* 0 = not started yet
* 1 = ready to start reading
*
* 10 = quitted
* */
/* now the numbers each cpu gathers. The name of the first number is what
* cpu0 is doing and the second name what all the other cpu's were doing at that
* time
*/
volatile BITBOARD readread; /* */
char dummycacheline[CACHELINELENGTH];
};
typedef struct {
BITBOARD nentries; // number of entries of 64 bits used for cache.
struct ProcessState ps[MAXPROCESSES];
} GlobalTree;
void RanrotAInit(void);
float ToNano(BITBOARD);
int GetClock(void);
float TimeRandom(void);
void ParseBuffer(BITBOARD);
void ClearHash(void);
void DeAllocate(void);
int DoNrng(BITBOARD);
int DoNreads(BITBOARD);
int DoNreadwrites(BITBOARD);
//void TestLatency(float);
int AllocateTree(void);
void InitTree(int);
void WaitForStatus(int,int);
void PutStatus(int,int);
int CheckStatus(int,int);
int CheckAllStatus(int,int);
void Slapen(int);
float LoopRandom(void);
/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK 17
#define JJ 10
#define R1 5
#define R2 3
/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2; /* indexes into history buffer */
/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
cpus; // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];
/******************************************************** AgF 1999-03-03 *
* Random Number generator 'RANROT' type B *
* by Agner Fog *
* *
* This is a lagged-Fibonacci type of random number generator with *
* rotation of bits. The algorithm is: *
* X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b *
* *
* The last k values of X are stored in a circular buffer named *
* randbuffer. *
* *
* This version works with any integer size: 16, 32, 64 bits etc. *
* The integers must be unsigned. The resolution depends on the integer *
* size. *
* *
* Note that the function RanrotAInit must be called before the first *
* call to RanrotA or iRanrotA *
* *
* The theory of the RANROT type of generators is described at *
* www.agner.org/random/ranrot.htm *
* *
*************************************************************************/
FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}
/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
/* generate next random number */
BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
/* rotate list pointers */
if( --r_p1 < 0)
r_p1 = KK - 1;
if( --r_p2 < 0 )
r_p2 = KK - 1;
return x;
}
/* this function initializes the random number generator. */
void RanrotAInit(void) {
int i;
/* one can fill the randbuffer here with possible other values here */
randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);
/* initialize pointers to circular buffer */
r_p1 = 0;
r_p2 = JJ;
/* randomize */
for( i = 0; i < 300; i++ )
(void)RanrotA();
}
/* Now the RASML code */
char *To64(BITBOARD x) {
static char buf[256];
char *sb;
sb = &buf[0];
#if UNIX
sprintf(buf,"%llu",x);
#else
sprintf(buf,"%I64u",x);
#endif
return sb;
}
int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
* to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
* for linux or unix we can only guess. Too many experts there.
*/
#if UNIX
struct timeval timeval;
struct timezone timezone;
gettimeofday(&timeval, &timezone);
return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
#else
return((int)GetTickCount());
#endif
}
float ToNano(BITBOARD nps) {
/* convert something from times a second to nanoseconds.
* NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
* SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
* NOT MODIFY THIS CODE */
float tn;
tn = 1000000000/(float)nps;
return tn;
}
float TimeRandom(void) {
/* timing the random number generator is very easy of course. Returns
* number of random numbers a second that can get generated
*/
BITBOARD bb=0,i,value,nps;
float ns_rng;
int t1,t2,took;
printf("Benchmarking Pseudo Random Number Generator speed, RanRot type 'B'!\n");
printf("Speed depends upon CPU and compile options from RASML,\n therefore we benchmark the RNG\n");
printf("Please wait a few seconds.. "); fflush(stdout);
value = 100000;
took = 0;
while( took < 3000 ) {
value <<= 2; // x4
t1 = GetClock();
for( i = 0; i < value; i++ ) {
bb ^= RanrotA();
}
t2 = GetClock();
took = t2-t1;
}
nps = (1000*value)/(BITBOARD)took;
#if UNIX
printf("..took %i milliseconds to generate %llu numbers\n",took,value);
printf("Speed of RNG = %llu numbers a second\n",nps);
#else
printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
printf("Speed of RNG = %I64u numbers a second\n",nps);
#endif
ns_rng = ToNano(nps);
printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);
return ns_rng;
}
void ParseBuffer(BITBOARD nbytes) {
tree->nentries = nbytes/sizeof(BITBOARD);
#if UNIX
printf("Trying to allocate %llu entries. ",tree->nentries);
printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
#else
printf("Trying to allocate %s entries. ",To64(tree->nentries));
printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
#endif
}
void ClearHash(void) {
BITBOARD *hi,i,nentries = tree->nentries;
/* clearing hashtable */
printf("Clearing hashtable for processor %i\n",ProcessNumber);
fflush(stdout);
hi = hashtable[ProcessNumber];
for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
hi[i] = i;
}
void DeAllocate(void) {
int i;
#if UNIX
shmctl(shm_tree,IPC_RMID,0);
for( i = 0; i < cpus; i++ ) {
shmctl(shm_hash[i],IPC_RMID,0);
}
#else
UnmapViewOfFile(tree);
for( i = 0; i < cpus; i++ ) {
UnmapViewOfFile(hashtable[i]);
}
#endif
}
int DoNrng(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= (index+(BITBOARD)i2);
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
int DoNreads(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2,ncpu;
ncpu = cpus;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD rani=RanrotA(),index=rani%nents;
unsigned int i2 = (unsigned int)(rani>>32)%ncpu;
dummyres ^= hashtable[i2][index];
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
#if 0
int DoNreadwrites(BITBOARD n) {
BITBOARD i=1,dummyres,nents;
int t1,t2;
nents = nentries; /* hopefully this gets into a register */
dummyres = globaldummy;
t1 = GetClock();
do {
BITBOARD index = RanrotA()%nents;
dummyres ^= hashtable[index];
hashtable[index] = dummyres;
} while( i++ < n );
t2 = GetClock();
globaldummy = dummyres;
return(t2-t1);
}
void TestLatency(float ns_rng) {
BITBOARD n,nps_read,nps_rw,nps_rng;
float ns,fns;
int timetaken;
printf("Doing random RNG test. Please wait..\n");
n = 50000000; // 50 mln
timetaken = DoNrng(n);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
/* READING SINGLE CPU RANDOM ENTRIES */
printf("Doing random read tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreads(n);
nps_read = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_read);
printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f nanoseconds a read\n",ns,ns-fns);
/* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
printf("Doing random readwrite tests single cpu. Please wait..\n");
n = 100000000; // 100 mln
timetaken = DoNreadwrites(n);
nps_rw = (1000*n) / (BITBOARD)timetaken;
ns = ToNano(nps_rw);
printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
printf("Extrapolated=%f nanoseconds a readwrite (to the same slot)\n\n",ns-fns);
printf("So far the useless tests.\nBut we have vague read/write nodes a second numbers now\n");
}
#endif
int AllocateTree(void) { /* initialize the tree. returns 0 if error */
#if UNIX
shm_tree = shmget(
ftok(".",'t'),
sizeof(GlobalTree),IPC_CREAT|0777);
if( shm_tree == -1 )
return 0;
tree = (GlobalTree *)shmat(shm_tree,0,0);
if( tree == (GlobalTree *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
if( !ProcessNumber ) {
HANDLE TreeFileMap;
TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)sizeof(GlobalTree),"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
else { /* Slaves attach also try to attach to the tree */
HANDLE TreeFileMap;
TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
if( TreeFileMap == NULL )
return 0;
tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( tree == NULL )
return 0;
}
#endif
return 1;
}
int AttachAll(void) {
#if UNIX
#else
HANDLE HashFileMap;
#endif
char hashname2[32] = {"RASM_Hash00"},hashname[32];
int i,r;
for( r = 0; r < cpus; r++ ) {
i = ProcessNumber+r;
i %= cpus;
if( i == ProcessNumber )
continue;
#if UNIX
shm_hash[i] = shmget(
#if IRIX
ftok(".",200+i),
#else
ftok(".",(char)i),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[i] == -1 )
return 0;
hashtable[i] = (BITBOARD *)shmat(shm_hash[i],0,0);
if( hashtable[i] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
strcpy(hashname,hashname2);
hashname[9] += (i/10);
hashname[10] += (i%10);
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[i] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[i] == NULL )
return 0;
#endif
}
return 1;
}
int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error */
char hashname[32] = {"RASM_Hash00"};
#if UNIX
shm_hash[ProcessNumber] = shmget(
#if IRIX
ftok(".",200+ProcessNumber),
#else
ftok(".",(char)ProcessNumber),
#endif
tree->nentries*8,IPC_CREAT|0777);
if( shm_hash[ProcessNumber] == -1 )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)shmat(shm_hash[ProcessNumber],0,0);
if( hashtable[ProcessNumber] == (BITBOARD *)-1 )
return 0;
#else /* so windows NT. This might even work under win98 and such crap OSes, but not win95 */
//if( !ProcessNumber ) {
HANDLE HashFileMap;
hashname[9] += (ProcessNumber/10);
hashname[10] += (ProcessNumber%10);
HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
(DWORD)tree->nentries*8,hashname);
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;
//}
//else { /* Slaves attach also try to attach to the tree */
/* HANDLE HashFileMap;
HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
if( HashFileMap == NULL )
return 0;
hashtable[ProcessNumber] = (BITBOARD *)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
if( hashtable[ProcessNumber] == NULL )
return 0;*/
//}
#endif
return 1;
}
int StartProcesses(int ncpus) {
char buf[256];
int i;
/* returns 1 if ncpus-1 started ok */
if( ncpus == 1 )
return 1;
for( i = 1 ; i < ncpus ; i++ ) {
sprintf(buf,"%i_%i",i+1,ncpus);
#if UNIX
if( !fork() )
execl(rasmexename,rasmexename,buf,NULL);
#else
(void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
#endif
}
return 1;
}
void InitTree(int ncpus) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = STATUS_NOTSTARTED;
tree->ps[i].readread = 0;
}
}
void WaitForStatus(int ncpus,int waitforstate) {
/* wait for all processors to have the same state */
int i,badluck=1;
while( badluck ) {
badluck = 0;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != waitforstate )
badluck = 1;
}
}
}
void PutStatus(int ncpus,int statenew) {
int i;
for( i = 0 ; i < ncpus ; i++ ) {
tree->ps[i].status = statenew;
}
}
int CheckStatus(int ncpus,int statenew) {
/* returns false when not all cpu's are in the new state */
int i;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != statenew )
return 0;
}
return 1;
}
int CheckAllStatus(int ncpus,int status) {
/* Tries with a single loop to determine whether the other cpu's also finished
*
* returns:
* true ==> when all the processes have this status
* false ==> when 1 or more are still busy measuring
*/
int i,badluck=1;
for( i = 0 ; i < ncpus ; i++ ) {
if( tree->ps[i].status != status ) {
badluck = 0;
break;
}
}
return badluck;
}
void Slapen(int ms) {
#if UNIX
usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
#else
Sleep(ms); /* 0.050 seconds, it is in milliseconds */
#endif
}
float LoopRandom(void) {
BITBOARD n,nps_rng;
float fns;
int timetaken;
printf("Benchmarking random RNG test. Please wait..\n");
n = 25000000; // 50 mln
timetaken = 0;
while( timetaken < 500 ) {
n += n;
timetaken = DoNrng(n);
}
printf("timetaken=%i\n",timetaken);
nps_rng = (1000*n) / (BITBOARD)timetaken;
fns = ToNano(nps_rng);
printf("Machine needs %f ns for RND loop\n",fns);
return fns;
}
/* Example showing how to use the random number generator: */
int main(int argc,char *argv[]) {
/* allocate a big memory buffer parameter is in bytes.
* don't hesitate to MODIFY this to how many gigabytes
* you want to try.
* The more the better i keep saying to myself.
*
* Note that under linux your maximum shared memory limit can be set with:
*
* echo <size> > /proc/sys/kernel/shmmax
*
* and under IRIX it is usually 80% from the total RAM onboard that can get allocated
*/
BITBOARD nbytes,firstguess;
float ns_rng,f_loop;
int tottimes,t1,t2;
if( argc <= 1 ) {
printf("Latency test usage is: latency <buffer> <cpus>\n");
printf("Where 'buffer' is the buffer in number of bytes to allocate PRO PROCESSOR\n");
printf("and where 'cpus' is the number of processes that this test will try to use (1 = default) \n");
return 1;
}
/* parse the input */
nbytes = 0;
cpus = 1; // default
if( strchr(argv[1],'_') == NULL ) { /* main startup process */
int np = 0;
#if UNIX
#if FREEBSD
nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
#else
nbytes = (BITBOARD)atoll(argv[1]);
#endif
#else
nbytes = (BITBOARD)_atoi64(argv[1]);
#endif
printf("Welcome to RASM Latency!\n");
printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");
if( argc > 2 ) {
cpus = 0;
do {
cpus *= 10;
cpus += (int)(argv[2][np]-'1')+1;
np++;
} while( argv[2][np] >= '0' && argv[2][np] <= '9' );
}
//printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
ProcessNumber = 0;
/* check whether we are not getting out of bounds */
if( cpus > MAXPROCESSES ) {
printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i processors is too much\n",cpus);
return 1;
}
/* find out the file name */
#if UNIX
strcpy(rasmexename,argv[0]);
#else
GetModuleFileName(NULL,rasmexename,2044);
#endif
printf("Stored in rasmexename = %s\n",rasmexename);
}
else { // latency 2_452 ==> means processor 2 out of 452.
int np = 0;
ProcessNumber = 0;
do {
ProcessNumber *= 10;
ProcessNumber += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]
np++; // skip underscore
cpus = 0;
do {
cpus *= 10;
cpus += (argv[1][np]-'1')+1; // n
np++;
} while( argv[1][np] >= '0' && argv[1][np] <= '9' );
//printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
}
/* first we setup the random number generator. */
RanrotAInit();
/* initialize shared memory tree; it gets used for communication between the processes */
if( !AllocateTree() ) {
printf("Error: ProcessNumber %i could not allocate the tree\n",ProcessNumber);
return 1;
}
if( !ProcessNumber )
ParseBuffer(nbytes);
nentries = tree->nentries;
/* Now some stuff only the Master has to do */
if( !ProcessNumber ) {
/* Master: now let's time the pseudo random generators speed in nanoseconds a call */
ns_rng = TimeRandom();
f_loop = LoopRandom();
printf("Trying to Allocate Buffer\n");
t1 = GetClock();
if( !AllocateHash() ) {
printf("Error: Could not allocate buffer!\n");
return 1;
}
t2 = GetClock();
printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
ClearHash(); // local hash
t1 = GetClock();
printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);
/* so now hashtable is setup and we know quite some stuff. So it is time to
* start all other processes */
InitTree(cpus);
printf("Starting Other processes\n");
t1 = GetClock();
if( !StartProcesses(cpus) ) {
printf("Error: Could not start processes\n");
DeAllocate();
}
t2 = GetClock();
printf("Took %i milliseconds to start %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
}
else { /* all Slaves do this */
if( !AllocateHash() ) {
printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
return 1;
}
ClearHash(); // local hash
}
tree->ps[ProcessNumber].status = STATUS_ATTACH;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACH);
t2 = GetClock();
printf("Took %i milliseconds to synchronize %i additional processes\n",t2-t1,cpus-1);
t1 = GetClock();
/* now we can continue with the next phase that is attaching all the segments */
PutStatus(cpus,STATUS_GOATTACH);
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACH ) {
Slapen(500);
}
}
if( !AttachAll() ) {
printf("Error: process %i Could not attach correctly!\n",ProcessNumber);
return 1;
}
tree->ps[ProcessNumber].status = STATUS_ATTACHED;
if( ! ProcessNumber ) {
WaitForStatus(cpus,STATUS_ATTACHED);
t2 = GetClock();
printf("Took %i milliseconds to ATTACH. %llu total RAM\n",t2-t1,(BITBOARD)cpus*tree->nentries*8);
PutStatus(cpus,STATUS_STARTREAD);
printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i seconds :\n",
(SWITCHTIME/1000),(SWITCHTIME%1000));
}
else {
while( tree->ps[ProcessNumber].status == STATUS_ATTACHED ) {
Slapen(500);
}
}
tree->ps[ProcessNumber].status = STATUS_READ;
firstguess = 200000;
tottimes = 0;
for( ;; ) {
int timetaken = 0;
if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
/* this really MEASURES the readread */
BITBOARD ntried = 0,avnumber;
int totaltime=0;
while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds */
totaltime += DoNreads(firstguess);
ntried += firstguess;
}
/* now put the average number of readreads into the shared memory */
avnumber = (ntried*1000) / (BITBOARD)totaltime;
tree->ps[ProcessNumber].readread = avnumber;
/* show that it is finished */
tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;
/* now keep doing the same thing until status gets modified */
while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
(void)DoNreads(firstguess);
if( !ProcessNumber ) {
if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
PutStatus(cpus,STATUS_QUIT);
break;
}
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
BITBOARD nextguess;
/* now software must try to determine how many reads a seconds are possible for that
* process
*/
//printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
timetaken = DoNreads(firstguess);
/* try to guess such that next test takes 1 second, or if test was too inaccurate
* then double the number simply. also prevents divide by zero error ;)
*/
if( timetaken < 400 )
nextguess = firstguess*2;
else
nextguess = (firstguess*1000)/(BITBOARD)timetaken;
firstguess = nextguess;
if( !ProcessNumber ) {
tottimes += timetaken;
if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
tottimes = 0;
if( CheckStatus(cpus,STATUS_READ) ) {
PutStatus(cpus,STATUS_MEASUREREAD);
} /* waits another SWITCH time before starting to measure */
}
}
}
else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
break;
}
/* now do the latency tests
*/
//TestLatency(ns_rng);
tree->ps[ProcessNumber].status = STATUS_QUIT;
if( !ProcessNumber ) {
BITBOARD averagereadread;
int i;
averagereadread = 0;
WaitForStatus(cpus,STATUS_QUIT);
printf("the raw output\n");
for( i = 0; i < cpus ; i++ ) {
BITBOARD tr=tree->ps[i].readread;
averagereadread += tr;
printf("%llu ",tr);
}
printf("\n");
averagereadread /= (BITBOARD)cpus;
printf("Raw Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread));
printf("Now for the final calculation it gets compensated:\n");
printf(" Average measured read read time at %i processes = %f ns\n",cpus,ToNano(averagereadread)-f_loop);
}
DeAllocate();
return 0;
}
/* EOF latencyC.c */
More information about the Beowulf
mailing list