[Beowulf] Accelerator for data compressing

Tue Oct 7 11:07:14 PDT 2008

Dmitri Chubarov wrote:
> Hello,
>
> we have got a VX50 down here as well. We have observed very different
> scalability on different applications. With an OpenMP molecular
> dynamics code we have got over 14 times speedup while on a 2D finite
> difference scheme I could not get far beyond 3 fold.
>   
2D finite difference can be comm intensive is the mesh is too small for 
each processor to have a fair amount of work to do before needing the 
neighboring values from a "far" node.
> On Tue, Oct 7, 2008 at 10:45 PM, Eric Thibodeau <kyron at neuralbs.com> wrote:
>   
>> PS: Interesting figures, I couldn't resist compressing the same binary DB on
>> a 16Core Opteron (Tyan VX50) machine and was dumbfounded to get horrible
>> results given the same context. The processing speed only came up to 6.4E6
>> bytes/sec ...for 16 cores, and they were all at 100% during the entire run
>> (FWIW, I tried different block sizes and it does have an impact but this
>> also changes the problem parameters).
>>     
>
> Reading your message in the Beowulf list I should say that it looks
> interesting and probably shows something happening with the memory
> access on the NUMA nodes. Did you try to run the archiver with
> different affinity settings?
>   
I don't have affinity control over the app per say. I would have to 
look/modify pbzip's code. Although, note that the PID's assignment to 
one processor is governed by the kernel and is thus a scheduler issue. 
Also note that I have noticed that the kernel doesn't just have fun 
moving the processes around the cores.
> We have observed that the memory architecture shows some strange
> behaviour. For instance the latency for a read from the same NUMA node
> for different nodes varies significantly.
>   
This is the nature of NUMA. Furthermore, if you have to cross to a far 
CPU, the latency is also dependent on the CPU's load.
> Also on the profiler I often see that x86 instructions that have one
> of the operands in memory may
> take disproportionally long. I believe that could explain the 100% CPU
> load reported by the kernel.
>   
How do you identify the specific instruction using a profiler, this is 
something that interests me.
> From the very little knowledge of this platform that we have got, I
> tend to advise the users not to expect good speedup on their
> multithreaded applications. 
Using OpenMP (from GCC 4.3.x) and an embarrassingly parallel problem 
(computing K-Means on a large database), I do get significant speedup 
(15-16).
> Yet it would be interesting to get a
> better understanding of the programming techniques for this sedecimus
> and the similar machines.
OpenMP is IMHO the easiest one that will bring you the most performance 
out of 3 lines of #pragma directives. If you manage to get a cluster of 
VX50s, then learn a bit of MPI to glue all of this together ;)
> Even more so due to the QPI systems becoming
> commercially available very soon.
Don't know that one (QPI)...oh...new Intel stuff...no matter how much I 
try to stay ahead, I'm always years behind!
>  At the moment we have got a few
> small kernels written in C and Fortran with OpenMP that we use to
> evaluate different parallelization strategies. Unfortunately, there
> are no tools I would know of that could help to explain what's going
> on inside the memory of this machine.
>   
Of course, check out TAU ( 
http://www.cs.uoregon.edu/research/tau/home.php ), it will at least help 
you identify bottlenecks and give you an impressive profiling 
infrastructure.
> I am very much interested to hear more about your experience with VX50.
>
> Best regards,
>   Dima Chubarov
>
> --
>   Dmitri Chubarov
>   junior researcher
>   Siberian Branch of the Russian Academy of Sciences
>   Institute of Computational Technologies
>   http://www.ict.nsc.ru/indexen.php
>   

Eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20081007/bb7f12af/attachment.html>