[Beowulf] Large amounts of data to store and process
Ellis H. Wilson III
ellis at ellisv3.com
Mon Mar 4 08:06:36 PST 2019
On 3/4/19 1:38 AM, Jonathan Aquilina wrote:
> Good Morning all,
>
> I am working on a project that I sadly cant go into much detail but
> there will be quite large amounts of data that will be ingested by this
> system and would need to be efficiently returned as output to the end
> user in around 10 min or so. I am in discussions with another partner
> involved in this project about the best way forward on this.
>
> For me given the amount of data (and it is a huge amount of data) that
> an RDBMS such as postgresql would be a major bottle neck. Another thing
> that was considered flat files, and I think the best for that would be a
> Hadoop cluster with HDFS. But in the case of HPC how can such an
> environment help in terms of ingesting and analytics of large amounts of
> data? Would said flat files of data be put on a SAN/NAS or something and
> through an NFS share accessed that way for computational purposes?
There has been a lot of good discussion about various tools (databases,
filesystems, processing frameworks, etc) on this thread, but I fear
we're putting the cart before the horse in many respects. A few key
questions/concerns that need to be answered/considered before you begin
the tool selection process:
1. Is there existing storage already and if so, in what ways does it
fail to meet this projects needs? This will give you key clues as to
what your new storage needs to deliver, or how you might ideally improve
or expand the existing storage system to meet those needs.
2. Remember that every time you create a distinct storage pool for a
distinct project you are creating a nightmare down the road, and are
dis-aggregating your capacity and performance. Especially with the
extremely thin pipes into hard drives today, the more you can keep them
working in concert the better. Hadoop, for all of its benefits, is a
typical example of storage isolation (usually from existing
more-posix-compliant NAS storage) that can create problems when future
projects come up and can't be ported to run atop HDFS.
3. Run the software in question against your sample dataset and collect
block traces. Do some analysis. Is it predominantly random I/O,
sequential I/O, or mixed? Is it metadata heavy or data heavy? What
does the file distribution look like? What kind of semantics are
expected by the application, or can these be adapted to meet what the
storage can provide? You may or may not be able to share some of these
stats depending on your NDA.
4. Talk with the application designers -- are there discrete phases to
their application(s)? This will help you intelligently block trace
those phases rather than the entire thing, which would be quite onerous.
Do they expect additional phases down the line, or will the software
"always" act this way roughly speaking? If you hyper-tune for a highly
sequential and metadata-light workload today but future workloads
attempt to index into it, there's another difficult project and another
discrete storage pool, which is unfortunate.
5. What future consumers of the data in question might there be down the
line? The more you purpose-build this system for just this project, the
bigger the headache you create for whatever future project wants to use
this data as well in a slightly different way. Some balance must be
struck here.
I think if you provided answers/responses to the above (and to many of
Joe's points) we could give you better advice. Trying to understand
what a wooden jointer plane is and how to use it prior to fully
understanding not only the immediate task at hand but also potential
future ways the data might be used is a recipe for disaster.
Best,
ellis
More information about the Beowulf
mailing list