[Beowulf] single machine with 500 GB of RAM
Ellis H. Wilson III
ellis at cse.psu.edu
Wed Jan 9 16:47:36 PST 2013
On 01/09/2013 05:53 PM, Mark Hahn wrote:
> given the "slicing" methodology that this application uses for
> decomposition, I wonder whether it's actually closer to sequential
> in its access patterns, rather than random. the point here is that
> you absolutely must have ram if your access is random, since your
> only constraint is latency. if a lot of your accesses are sequential,
> then they are potentially much more IO-like - specifically disk-like.
This is a great point, but I'd caution against (not suggesting you are)
jumping straight to the conclusion that because a data space can be
sliced it must be sequential in nature. I worked a bit with Orca
(Gaussian-type of chem app) in my undergrad and found some of the
sub-applications in the project were near-EP and sliced up cleanly to be
merged later, but still random in nature during the bulk of the
execution. They just split the data space into "cubes" of sorts and
distribute those bounds (and sometimes the associated data) to the slave
procs. Within each process the accesses to their "cube" of data were
near to completely random.
This is a great example of why one should trace the application with a
wide array of real-life parameters before committing to the purchase.
Even something as trivial as dstat run while the application chugged
along would give you a good hint at where the bottlenecks exist and what
you want to spend your cash on.
>> have proper memory, it isn't optimized, and as a result you're
>> constantly swapping. Merges are a good example of what /should/ work
>
> if the domain is sliced in the "right" direction, merging should be
> very efficient. even if sliced in the wrong direction, merging should
> at least be block-able (and thus not terrible.)
Ah, excellent point!
>> merging on just one of them that is also outfitted with a ramdisk'd 0.5
>> TB Fusion-IO PCI-E flash device. If I am not wildly off the mark on the
>
> I wouldn't bother with PCI-E flash, myself. they tend to have
> dumb/traditional raid controllers on them. doing raid0 across a
> handful of cheap 2.5" SATA SSDs is ridiculously easy to do and will
> scale well up fairly well (with some attention to the PCIE topology
> connecting the controllers, of course.)
My attention to them has revolved more around the latency issues that
arise when you add a HW raid controller and the SATA protocol between
the CPU and the SSD. This is exacerbated if the access patterns do not
remain completely random but converge over the lifetime of the run -- in
that case you may run into the accesses localizing onto a single SSD in
the RAID array and bottlenecking performance towards the tail-end of the
execution. This of course could happen to a certain extent on a PCI-E
device as well, but it's somewhat less likely since those fancy devices
tend to do migration if a channel/package/die becomes massively
overworked. Maybe there are raid controllers that do this as well, I'm
unsure.
I saw the bit on toms hardware about scaling SSDs for throughput and
IOPs, but is there anywhere a scaling study on them where individual
requests latencies are measured and CDF'd? That would be really
interesting to me, because although IOP/s is a nice term to throw
around, if your application is waiting on a particular request to come
back it doesn't really matter how many other IO's are completing in the
interim -- you need that specific request to complete and to complete
quickly to continue chugging along.
Best,
ellis
More information about the Beowulf
mailing list