[Beowulf] Pretty Big Data
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Mon Jan 25 07:13:11 PST 2016
Comment interspersed below.
On 1/25/16, 6:41 AM, "Beowulf on behalf of Jason Riedy"
<beowulf-bounces at beowulf.org on behalf of jason at lovesgoodfood.com> wrote:
>And Christopher Samuel writes:
>> The rest of us will carry on as before I suspect...
>Using libraries that hide the (sometimes proprietary) API behind
>sufficient POSIX semantics... Pretty much what the linked article says.
>The "new" architecture is just the old architecture with the fastest
>components at the relevant data sizes. The host CPU is faster than the
>storage CPU (in general), so move FS logic there when possible. Limit
>meta data scaling needs by splitting the storage. eh.
>But I suspect the point is that people think they're willing to give up
>the POSIX semantics they cannot even specify (and often already have
>given up) to say they're using faster hardware. Kinda like computational
>accelerators. Those started with lighter semantics: single user, no
>double precision, no atomics, no...
>The big, open question that's terribly difficult to address in research
>space: How do you efficiently mix multiple massive storage allocations
>that need high performance for a three to five year funding period and
>then archival storage afterward? I suspect much commercial data has a
>similar time horizon for immediate usefulness. Health care data is
>interestingly different. CPU allocations are relatively short, so
>inefficiencies from splitting that usage are relatively short-lived.
>Storage lasts longer.
That¹s an interesting point. Your funding is short lived, so ³get results
now² might be more important than ³get more results cheaply later². That
pushes towards standardized familiar access methods (POSIX) than custom
APIs, because I suspect that in most of these cases, the rate limiting
resource is the software development, not the computational/storage
And it has that interesting ³process it now, but save everything for
later² aspect. Much of ³the web² is ephemeral, and is clearly not
intended for long time archiving and retrieval; given the large number of
dollars being spent on tools to use ³the web² they¹re going to be tailored
for that kind of ³retrieve recent data² kind of model.
(After all, I don¹t find myself searching through 5 year old emails very
often.. I do, but not very often compared to ³what did X send me a month
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf