[Beowulf] Large amounts of data to store and process
John Hearns
hearnsj at googlemail.com
Mon Mar 4 04:36:33 PST 2019
Jonathan, I am going to stick my neck out here. I feel that HDFS was a
'thing of its time' - people are slavishly building clusters with local
SATA drives to follow that recipe.
Current parallel filesystems have adapters which make them behave like HDFS
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_hadoopconnector.htm
http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
Also you all know what is coming next. Julia. (Sorry all!)
https://wilmott.com/big-time-series-analysis-with-juliadb/ (I guess this
is specific to finance)
https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/
On Mon, 4 Mar 2019 at 11:16, Jonathan Aquilina <jaquilina at eagleeyet.net>
wrote:
> I read though that postgres can handle time shift data no problem. I am
> just concerned if the clients would want to do complex big data analytics
> on the data. At this stage we are just prototyping but things are very up
> in the air at this point I am wondering though if sticking with HDFS and
> Hadoop is the best way to go for this in terms of performance and over all
> analytical capabilities.
>
> What I am trying to understand is how Hadoop being written in java is so
> performant.
>
> Regards,
> Jonathan
>
> On 04/03/2019, 12:11, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>
> Hi Jonathan
>
> I have used PostgreSQL for collecting data, but there's nothing there
> that would be of use to you!
>
> A few years ago I set up a similar system (in a hurry) in a small
> company. The bulk data was compressed and it was made available to the
> applications via NFS (IPoIB). The applications were responsible
> for decompressing and pre/post-processing the data. Later, one of the
> developers created a PostgreSQL based system to hold all the data, he used
> C++ for all the data handling. That system was never
> used, even though all the historical data was loaded into the database!
>
> Your choice of components is going to depend on how your analytics
> software are going to access the data. If the data are being read and
> processed once, then loading into a database, then querying it
> once may not pay off.
>
> Cheers,
> Fred
>
> On 04/03/2019 09:24, Jonathan Aquilina wrote:
> > Hi Fred,
> >
> > I and my colleague had done some research and found an extension for
> postgresql called timescaleDB, but then upon further research postgres on
> its own is good for such data as well. The thing is these are not going to
> be given to use as the data is coming in but in bulk at the end from the
> parent company.
> >
> > Have you used postgresql for such type's of data and how has it
> performed?
> >
> > Regards,
> > Jonathan
> >
> > On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
> >
> > Hi Jonathan,
> >
> > It seems you're collecting metrics and time series data.
> Perhaps a time series database (TSDB) is an option for you. There are a few
> of these out there, but I don't have any personal recommendation.
> >
> > Cheers,
> > Fred
> >
> > On 04/03/2019 07:04, Jonathan Aquilina wrote:
> > > These would be numerical data such as integers or floating
> point numbers.
> > >
> > > -----Original Message-----
> > > From: Tony Brian Albers <tba at kb.dk>
> > > Sent: 04 March 2019 08:04
> > > To: beowulf at beowulf.org; Jonathan Aquilina <
> jaquilina at eagleeyet.net>
> > > Subject: Re: [Beowulf] Large amounts of data to store and
> process
> > >
> > > Hi Jonathan,
> > >
> > > From my limited knowledge of the technologies, I would say
> that HBase with file pointers to the files placed on HDFS would suit you
> well.
> > >
> > > But if the files are log files, consider some tools that are
> suited for analyzing those like Kibana.
> > >
> > > /tony
> > >
> > >
> > > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
> > >> Hi Tony,
> > >>
> > >> Sadly I cant go into much detail due to me being under an
> NDA. At this
> > >> point with the prototype we have around 250gb of sample data
> but again
> > >> this data is dependent on the type of air craft. Larger
> aircraft and
> > >> longer flights will generate a lot more data as they have
> more
> > >> sensors and will log more data than the sample data that I
> have. The
> > >> sample data is 250gb for 35 aircraft of the same type.
> > >>
> > >> Regards,
> > >> Jonathan
> > >>
> > >> -----Original Message-----
> > >> From: Tony Brian Albers <tba at kb.dk>
> > >> Sent: 04 March 2019 07:48
> > >> To: beowulf at beowulf.org; Jonathan Aquilina <
> jaquilina at eagleeyet.net>
> > >> Subject: Re: [Beowulf] Large amounts of data to store and
> process
> > >>
> > >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
> > >>> Good Morning all,
> > >>>
> > >>> I am working on a project that I sadly cant go into much
> detail but
> > >>> there will be quite large amounts of data that will be
> ingested by
> > >>> this system and would need to be efficiently returned as
> output to
> > >>> the end user in around 10 min or so. I am in discussions
> with
> > >>> another partner involved in this project about the best way
> forward
> > >>> on this.
> > >>>
> > >>> For me given the amount of data (and it is a huge amount of
> data)
> > >>> that an RDBMS such as postgresql would be a major bottle
> neck.
> > >>> Another thing that was considered flat files, and I think
> the best
> > >>> for that would be a Hadoop cluster with HDFS. But in the
> case of HPC
> > >>> how can such an environment help in terms of ingesting and
> analytics
> > >>> of large amounts of data? Would said flat files of data be
> put on a
> > >>> SAN/NAS or something and through an NFS share accessed that
> way for
> > >>> computational purposes?
> > >>>
> > >>> Regards,
> > >>> Jonathan
> > >>> _______________________________________________
> > >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin
> > >>> Computing To change your subscription (digest mode or
> unsubscribe)
> > >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
> > >>
> > >> Good morning,
> > >>
> > >> Around here, we're using HBase for similar purposes. We have
> a bunch
> > >> of smaller nodes storing the data and all the management
> nodes(ambari,
> > >> HDFS namenodes etc.) are vm's.
> > >>
> > >> Our nodes are configured so that we have a maximum of 2
> cores per disk
> > >> spindle and 4G of memory for each core. This seems to do the
> trick and
> > >> is pretty responsive.
> > >>
> > >> But to be able to provide better advice, you will probably
> need to go
> > >> into a bit more detail about what types of data you will be
> storing
> > >> and which kind of calculations you want to perform.
> > >>
> > >> /tony
> > >>
> > >>
> > >> --
> > >> Tony Albers - Systems Architect - IT Development Royal
> Danish Library,
> > >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
> > >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> > >
> > > --
> > > Tony Albers - Systems Architect - IT Development Royal Danish
> Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
> > > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin Computing
> > > To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/48771c9f/attachment-0001.html>
More information about the Beowulf
mailing list