[Beowulf] Large amounts of data to store and process

Mon Mar 4 07:04:07 PST 2019

I think you are asking more than one question. I think you need real time
communication, fast reliable storage, analytics and presentation for
investors.
Making your needs clear will help people help you.

On Mon, Mar 4, 2019, 6:28 AM Joe Landman <joe.landman at gmail.com> wrote:

>
> On 3/4/19 1:55 AM, Jonathan Aquilina wrote:
> > Hi Tony,
> >
> > Sadly I cant go into much detail due to me being under an NDA. At this
> point with the prototype we have around 250gb of sample data but again this
> data is dependent on the type of air craft. Larger aircraft and longer
> flights will generate a lot more data as they have  more sensors and will
> log more data than the sample data that I have. The sample data is 250gb
> for 35 aircraft of the same type.
>
>
> You need to return your answers in ~10m or 600s, with an assumed data
> set size of 250GB or more (assuming you meant GB and not Gb).  Depending
> upon the nature of the calculation, whether or not you can perform the
> calculations on subsets, or if it requires multiple passes through the
> data in order to calculate.
>
> I've noticed some recommendations popping up ahead of understanding what
> the rate limiting factors for returning the results from calculations
> based upon this data set.  I'd suggest focusing on the analysis needs to
> start, as this will provide some level of guidance on the system(s)
> design required to meet your objectives.
>
> First off, do you know whether your code will meet this 600s response
> time with this 250GB data set?  I am assuming this is unknown at this
> moment, but if you have response time data for smaller data sets, you
> could construct a rough scaling study and build a simple predictive model.
>
> Second, do you need the entire bolus of data, all 250GB, in order to
> generate a response to within required accuracy?  If not, great, and
> what size do you need?
>
> Third, will this data set grow over time (looking at your writeup, it
> looks like this is a definite "yes")?
>
> Fourth, does the code require physical access to all of the data bolus
> (what is needed for the calculation) locally in order to correctly operate?
>
> Fifth, will the data access patterns for the code be streaming,
> searching, or random?  In only one of these cases would a database (SQL
> or noSQL) be a viable option.
>
> Sixth, is your working data set size comparable to the bolus size (e.g.
> 250GB)?
>
> Seventh, can your code work correctly with sharded data (variation on
> second point)?
>
>
> Now some brief "data physics".
>
> a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once,
> assuming large block sequential read.  For a 600s response time, that
> leaves you with 350s to calculate.  Is this enough time?  Is a single
> pass (streaming) workable?
>
> b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once in
> parallel amongst multiple cores.  If multiple/many passes through data
> are required, this strongly suggests a large memory machine (512GB or
> larger).
>
> c) if your data is shardable, and you can distribute it amongst N
> machines, the above analyses still hold, replacing the 250GB with the
> size of the shards.  If you can do this, how much information does your
> code need to share amongst the worker nodes in order to effect the
> calculation?  This will provide guidance on interconnect choices.
>
>
> Basically, I am advocating focusing on the analysis needs, how the
> scale/grow, and your near/medium/long term goals with this, before you
> commit to a specific design/implementation.  Avoid the "if all you have
> is a hammer, every problem looks like a nail" view as much as possible.
>
>
> --
> Joe Landman
> e: joe.landman at gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/0020eb05/attachment-0001.html>