[Beowulf] Large amounts of data to store and process

Rémy Dernat remy.dernat at umontpellier.fr
Mon Mar 4 05:17:12 PST 2019


Hi,

I don't know exactly what you would like to do with these datas, but if 
I were you, I would take a close look at elasticsearch, spark or even 
hdf5, depending on what your analysis software looks like (is coded 
with...), to see if these technologies could save up some time. 
Moreover, I won't dismiss NFS easily, espacially if your infrastructure 
already use it.

Elasticsearch can be plugged easily with kibana.

Best regards,


On 04/03/2019 08:04, Jonathan Aquilina wrote:
> These would be numerical data such as integers or floating point numbers.
>
> -----Original Message-----
> From: Tony Brian Albers <tba at kb.dk>
> Sent: 04 March 2019 08:04
> To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net>
> Subject: Re: [Beowulf] Large amounts of data to store and process
>
> Hi Jonathan,
>
>  From my limited knowledge of the technologies, I would say that HBase with file pointers to the files placed on HDFS would suit you well.
>
> But if the files are log files, consider some tools that are suited for analyzing those like Kibana.
>
> /tony
>
>
> On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
>> Hi Tony,
>>
>> Sadly I cant go into much detail due to me being under an NDA. At this
>> point with the prototype we have around 250gb of sample data but again
>> this data is dependent on the type of air craft. Larger aircraft and
>> longer flights will generate a lot more data as they have  more
>> sensors and will log more data than the sample data that I have. The
>> sample data is 250gb for 35 aircraft of the same type.
>>
>> Regards,
>> Jonathan
>>
>> -----Original Message-----
>> From: Tony Brian Albers <tba at kb.dk>
>> Sent: 04 March 2019 07:48
>> To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net>
>> Subject: Re: [Beowulf] Large amounts of data to store and process
>>
>> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
>>> Good Morning all,
>>>
>>> I am working on a project that I sadly cant go into much detail but
>>> there will be quite large amounts of data that will be ingested by
>>> this system and would need to be efficiently returned as output to
>>> the end user in around 10 min or so. I am in discussions with
>>> another partner involved in this project about the best way forward
>>> on this.
>>>
>>> For me given the amount of data (and it is a huge amount of data)
>>> that an RDBMS such as postgresql would be a major bottle neck.
>>> Another thing that was considered flat files, and I think the best
>>> for that would be a Hadoop cluster with HDFS. But in the case of HPC
>>> how can such an environment help in terms of ingesting and analytics
>>> of large amounts of data? Would said flat files of data be put on a
>>> SAN/NAS or something and through an NFS share accessed that way for
>>> computational purposes?
>>>
>>> Regards,
>>> Jonathan
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing To change your subscription (digest mode or unsubscribe)
>>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
>> Good morning,
>>
>> Around here, we're using HBase for similar purposes. We have a bunch
>> of smaller nodes storing the data and all the management nodes(ambari,
>> HDFS namenodes etc.) are vm's.
>>
>> Our nodes are configured so that we have a maximum of 2 cores per disk
>> spindle and 4G of memory for each core. This seems to do the trick and
>> is pretty responsive.
>>
>> But to be able to provide better advice, you will probably need to go
>> into a bit more detail about what types of data you will be storing
>> and which kind of calculations you want to perform.
>>
>> /tony
>>
>>
>> --
>> Tony Albers - Systems Architect - IT Development Royal Danish Library,
>> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> --
> Tony Albers - Systems Architect - IT Development Royal Danish Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Rémy Dernat
Ingénieur Système/Calcul
Plateforme MBB
Institut des Sciences de l'Evolution - Montpellier


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3623 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/762f70ab/attachment.bin>


More information about the Beowulf mailing list