[Beowulf] Large amounts of data to store and process

John Hearns hearnsj at googlemail.com
Mon Mar 4 04:36:33 PST 2019


Jonathan, I am going to stick my neck out here. I feel that HDFS was a
'thing of its time' - people are slavishly building clusters with local
SATA drives to follow that recipe.
Current parallel filesystems have adapters which make them behave like HDFS
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_hadoopconnector.htm
http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre

Also you all know what is coming next.  Julia.  (Sorry all!)
https://wilmott.com/big-time-series-analysis-with-juliadb/    (I guess this
is specific to finance)
https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/


On Mon, 4 Mar 2019 at 11:16, Jonathan Aquilina <jaquilina at eagleeyet.net>
wrote:

> I read though that postgres can handle time shift data no problem. I am
> just concerned if the clients would want to do complex big data analytics
> on the data. At this stage we are just prototyping but things are very up
> in the air at this point I am wondering though if sticking with HDFS and
> Hadoop is the best way to go for this in terms of performance and over all
> analytical capabilities.
>
> What I am trying to understand is how Hadoop being written in java is so
> performant.
>
> Regards,
> Jonathan
>
> On 04/03/2019, 12:11, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>
>     Hi Jonathan
>
>     I have used PostgreSQL for collecting data, but there's nothing there
> that would be of use to you!
>
>     A few years ago I set up a similar system (in a hurry) in a small
> company. The bulk data was compressed and it was made available to the
> applications via NFS (IPoIB). The applications were responsible
>     for decompressing and pre/post-processing the data. Later, one of the
> developers created a PostgreSQL based system to hold all the data, he used
> C++ for all the data handling. That system was never
>     used, even though all the historical data was loaded into the database!
>
>     Your choice of components is going to depend on how your analytics
> software are going to access the data. If the data are being read and
> processed once, then loading into a database, then querying it
>     once may not pay off.
>
>     Cheers,
>     Fred
>
>     On 04/03/2019 09:24, Jonathan Aquilina wrote:
>     > Hi Fred,
>     >
>     > I and my colleague had done some research and found an extension for
> postgresql called timescaleDB, but then upon further research postgres on
> its own is good for such data as well. The thing is these are not going to
> be given to use as the data is coming in but in bulk at the end from the
> parent company.
>     >
>     > Have you used postgresql for such type's of data and how has it
> performed?
>     >
>     > Regards,
>     > Jonathan
>     >
>     > On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>     >
>     >      Hi Jonathan,
>     >
>     >      It seems you're collecting metrics and time series data.
> Perhaps a time series database (TSDB) is an option for you. There are a few
> of these out there, but I don't have any personal recommendation.
>     >
>     >      Cheers,
>     >      Fred
>     >
>     >      On 04/03/2019 07:04, Jonathan Aquilina wrote:
>     >      > These would be numerical data such as integers or floating
> point numbers.
>     >      >
>     >      > -----Original Message-----
>     >      > From: Tony Brian Albers <tba at kb.dk>
>     >      > Sent: 04 March 2019 08:04
>     >      > To: beowulf at beowulf.org; Jonathan Aquilina <
> jaquilina at eagleeyet.net>
>     >      > Subject: Re: [Beowulf] Large amounts of data to store and
> process
>     >      >
>     >      > Hi Jonathan,
>     >      >
>     >      >  From my limited knowledge of the technologies, I would say
> that HBase with file pointers to the files placed on HDFS would suit you
> well.
>     >      >
>     >      > But if the files are log files, consider some tools that are
> suited for analyzing those like Kibana.
>     >      >
>     >      > /tony
>     >      >
>     >      >
>     >      > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
>     >      >> Hi Tony,
>     >      >>
>     >      >> Sadly I cant go into much detail due to me being under an
> NDA. At this
>     >      >> point with the prototype we have around 250gb of sample data
> but again
>     >      >> this data is dependent on the type of air craft. Larger
> aircraft and
>     >      >> longer flights will generate a lot more data as they have
> more
>     >      >> sensors and will log more data than the sample data that I
> have. The
>     >      >> sample data is 250gb for 35 aircraft of the same type.
>     >      >>
>     >      >> Regards,
>     >      >> Jonathan
>     >      >>
>     >      >> -----Original Message-----
>     >      >> From: Tony Brian Albers <tba at kb.dk>
>     >      >> Sent: 04 March 2019 07:48
>     >      >> To: beowulf at beowulf.org; Jonathan Aquilina <
> jaquilina at eagleeyet.net>
>     >      >> Subject: Re: [Beowulf] Large amounts of data to store and
> process
>     >      >>
>     >      >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
>     >      >>> Good Morning all,
>     >      >>>
>     >      >>> I am working on a project that I sadly cant go into much
> detail but
>     >      >>> there will be quite large amounts of data that will be
> ingested by
>     >      >>> this system and would need to be efficiently returned as
> output to
>     >      >>> the end user in around 10 min or so. I am in discussions
> with
>     >      >>> another partner involved in this project about the best way
> forward
>     >      >>> on this.
>     >      >>>
>     >      >>> For me given the amount of data (and it is a huge amount of
> data)
>     >      >>> that an RDBMS such as postgresql would be a major bottle
> neck.
>     >      >>> Another thing that was considered flat files, and I think
> the best
>     >      >>> for that would be a Hadoop cluster with HDFS. But in the
> case of HPC
>     >      >>> how can such an environment help in terms of ingesting and
> analytics
>     >      >>> of large amounts of data? Would said flat files of data be
> put on a
>     >      >>> SAN/NAS or something and through an NFS share accessed that
> way for
>     >      >>> computational purposes?
>     >      >>>
>     >      >>> Regards,
>     >      >>> Jonathan
>     >      >>> _______________________________________________
>     >      >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin
>     >      >>> Computing To change your subscription (digest mode or
> unsubscribe)
>     >      >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
>     >      >>
>     >      >> Good morning,
>     >      >>
>     >      >> Around here, we're using HBase for similar purposes. We have
> a bunch
>     >      >> of smaller nodes storing the data and all the management
> nodes(ambari,
>     >      >> HDFS namenodes etc.) are vm's.
>     >      >>
>     >      >> Our nodes are configured so that we have a maximum of 2
> cores per disk
>     >      >> spindle and 4G of memory for each core. This seems to do the
> trick and
>     >      >> is pretty responsive.
>     >      >>
>     >      >> But to be able to provide better advice, you will probably
> need to go
>     >      >> into a bit more detail about what types of data you will be
> storing
>     >      >> and which kind of calculations you want to perform.
>     >      >>
>     >      >> /tony
>     >      >>
>     >      >>
>     >      >> --
>     >      >> Tony Albers - Systems Architect - IT Development Royal
> Danish Library,
>     >      >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     >      >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     >      >
>     >      > --
>     >      > Tony Albers - Systems Architect - IT Development Royal Danish
> Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     >      > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     >      > _______________________________________________
>     >      > Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin Computing
>     >      > To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf
>     >      >
>     >      _______________________________________________
>     >      Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     >      To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     >
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/48771c9f/attachment-0001.html>


More information about the Beowulf mailing list