<div dir="auto">I think you are asking more than one question. I think you need real time communication, fast reliable storage, analytics and presentation for investors.<div dir="auto">Making your needs clear will help people help you.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 4, 2019, 6:28 AM Joe Landman <<a href="mailto:joe.landman@gmail.com">joe.landman@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
On 3/4/19 1:55 AM, Jonathan Aquilina wrote:<br>
> Hi Tony,<br>
><br>
> Sadly I cant go into much detail due to me being under an NDA. At this point with the prototype we have around 250gb of sample data but again this data is dependent on the type of air craft. Larger aircraft and longer flights will generate a lot more data as they have more sensors and will log more data than the sample data that I have. The sample data is 250gb for 35 aircraft of the same type.<br>
<br>
<br>
You need to return your answers in ~10m or 600s, with an assumed data <br>
set size of 250GB or more (assuming you meant GB and not Gb). Depending <br>
upon the nature of the calculation, whether or not you can perform the <br>
calculations on subsets, or if it requires multiple passes through the <br>
data in order to calculate.<br>
<br>
I've noticed some recommendations popping up ahead of understanding what <br>
the rate limiting factors for returning the results from calculations <br>
based upon this data set. I'd suggest focusing on the analysis needs to <br>
start, as this will provide some level of guidance on the system(s) <br>
design required to meet your objectives.<br>
<br>
First off, do you know whether your code will meet this 600s response <br>
time with this 250GB data set? I am assuming this is unknown at this <br>
moment, but if you have response time data for smaller data sets, you <br>
could construct a rough scaling study and build a simple predictive model.<br>
<br>
Second, do you need the entire bolus of data, all 250GB, in order to <br>
generate a response to within required accuracy? If not, great, and <br>
what size do you need?<br>
<br>
Third, will this data set grow over time (looking at your writeup, it <br>
looks like this is a definite "yes")?<br>
<br>
Fourth, does the code require physical access to all of the data bolus <br>
(what is needed for the calculation) locally in order to correctly operate?<br>
<br>
Fifth, will the data access patterns for the code be streaming, <br>
searching, or random? In only one of these cases would a database (SQL <br>
or noSQL) be a viable option.<br>
<br>
Sixth, is your working data set size comparable to the bolus size (e.g. <br>
250GB)?<br>
<br>
Seventh, can your code work correctly with sharded data (variation on <br>
second point)?<br>
<br>
<br>
Now some brief "data physics".<br>
<br>
a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once, <br>
assuming large block sequential read. For a 600s response time, that <br>
leaves you with 350s to calculate. Is this enough time? Is a single <br>
pass (streaming) workable?<br>
<br>
b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once in <br>
parallel amongst multiple cores. If multiple/many passes through data <br>
are required, this strongly suggests a large memory machine (512GB or <br>
larger).<br>
<br>
c) if your data is shardable, and you can distribute it amongst N <br>
machines, the above analyses still hold, replacing the 250GB with the <br>
size of the shards. If you can do this, how much information does your <br>
code need to share amongst the worker nodes in order to effect the <br>
calculation? This will provide guidance on interconnect choices.<br>
<br>
<br>
Basically, I am advocating focusing on the analysis needs, how the <br>
scale/grow, and your near/medium/long term goals with this, before you <br>
commit to a specific design/implementation. Avoid the "if all you have <br>
is a hammer, every problem looks like a nail" view as much as possible.<br>
<br>
<br>
-- <br>
Joe Landman<br>
e: <a href="mailto:joe.landman@gmail.com" target="_blank" rel="noreferrer">joe.landman@gmail.com</a><br>
t: @hpcjoe<br>
w: <a href="https://scalability.org" rel="noreferrer noreferrer" target="_blank">https://scalability.org</a><br>
g: <a href="https://github.com/joelandman" rel="noreferrer noreferrer" target="_blank">https://github.com/joelandman</a><br>
l: <a href="https://www.linkedin.com/in/joelandman" rel="noreferrer noreferrer" target="_blank">https://www.linkedin.com/in/joelandman</a><br>
<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank" rel="noreferrer">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
</blockquote></div>