[Beowulf] Large amounts of data to store and process

Mon Mar 4 08:06:36 PST 2019

On 3/4/19 1:38 AM, Jonathan Aquilina wrote:
> Good Morning all,
> 
> I am working on a project that I sadly cant go into much detail but 
> there will be quite large amounts of data that will be ingested by this 
> system and would need to be efficiently returned as output to the end 
> user in around 10 min or so. I am in discussions with another partner 
> involved in this project about the best way forward on this.
> 
> For me given the amount of data (and it is a huge amount of data) that 
> an RDBMS such as postgresql would be a major bottle neck. Another thing 
> that was considered flat files, and I think the best for that would be a 
> Hadoop cluster with HDFS. But in the case of HPC how can such an 
> environment help in terms of ingesting and analytics of large amounts of 
> data? Would said flat files of data be put on a SAN/NAS or something and 
> through an NFS share accessed that way for computational purposes?

There has been a lot of good discussion about various tools (databases, 
filesystems, processing frameworks, etc) on this thread, but I fear 
we're putting the cart before the horse in many respects.  A few key 
questions/concerns that need to be answered/considered before you begin 
the tool selection process:

1. Is there existing storage already and if so, in what ways does it 
fail to meet this projects needs?  This will give you key clues as to 
what your new storage needs to deliver, or how you might ideally improve 
or expand the existing storage system to meet those needs.

2. Remember that every time you create a distinct storage pool for a 
distinct project you are creating a nightmare down the road, and are 
dis-aggregating your capacity and performance.  Especially with the 
extremely thin pipes into hard drives today, the more you can keep them 
working in concert the better.  Hadoop, for all of its benefits, is a 
typical example of storage isolation (usually from existing 
more-posix-compliant NAS storage) that can create problems when future 
projects come up and can't be ported to run atop HDFS.

3. Run the software in question against your sample dataset and collect 
block traces.  Do some analysis.  Is it predominantly random I/O, 
sequential I/O, or mixed?  Is it metadata heavy or data heavy?  What 
does the file distribution look like?  What kind of semantics are 
expected by the application, or can these be adapted to meet what the 
storage can provide?  You may or may not be able to share some of these 
stats depending on your NDA.

4. Talk with the application designers -- are there discrete phases to 
their application(s)?  This will help you intelligently block trace 
those phases rather than the entire thing, which would be quite onerous. 
  Do they expect additional phases down the line, or will the software 
"always" act this way roughly speaking?  If you hyper-tune for a highly 
sequential and metadata-light workload today but future workloads 
attempt to index into it, there's another difficult project and another 
discrete storage pool, which is unfortunate.

5. What future consumers of the data in question might there be down the 
line?  The more you purpose-build this system for just this project, the 
bigger the headache you create for whatever future project wants to use 
this data as well in a slightly different way.  Some balance must be 
struck here.

I think if you provided answers/responses to the above (and to many of 
Joe's points) we could give you better advice.  Trying to understand 
what a wooden jointer plane is and how to use it prior to fully 
understanding not only the immediate task at hand but also potential 
future ways the data might be used is a recipe for disaster.

Best,

ellis