[Beowulf] Large amounts of data to store and process

Mon Mar 4 07:49:29 PST 2019

> I read though that postgres can handle time shift data no problem. I am
> just concerned if the clients would want to do complex big data analytics
> on the data. At this stage we are just prototyping but things are very up
> in the air at this point I am wondering though if sticking with HDFS and
> Hadoop is the best way to go for this in terms of performance and over all
> analytical capabilities.
>
> What I am trying to understand is how Hadoop being written in java is so
> performant.

Hadoop is about optimizing the slow points for large amounts
of data (i.e. reading disk) "Moving computation to data"
is the general idea. (BTW, the Hadoop resource
manager, YARN, treats data locality as a resource)

HDFS is a layer on top of a native FS that is designed
as a distributed (not parallel) write-one read-many file system.
It basically provides "sliced" data across various
storage nodes for MapReduce operations.

Hadoop's speed comes from exploiting MapReduce SIMD parallelism in many
operations (mostly analytics where scanning large amounts
of data is important).

Classic "batch" oriented multi-stage MR jobs can be slow because
intermediate data is written to disk. This ensured that huge amounts
of data could be analyzed but slowed down small to medium sized jobs.

Hadoop now has an optimization layer (Called Tez) that will keep
data in memory between stages. High level tools like Pig and Hive
use Tez by default.

Spark puts everything in memory and is therefore fast. Unfortunately
there was a lot of "Hadoop is slow and Spark is fast" benchmarks
comparing Hadoop batch (write to disk) to native Spark. Tez has
pretty much eliminated this advantage. I get asked all the time
"what should I use, Pig, Hive, Spark?" My response is always
"It all depends on what you want to do."

Hope that helps

--
Doug

>
> Regards,
> Jonathan
>
> ï»¿On 04/03/2019, 12:11, "Beowulf on behalf of Fred Youhanaie"
> <beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>
>     Hi Jonathan
>
>     I have used PostgreSQL for collecting data, but there's nothing there
> that would be of use to you!
>
>     A few years ago I set up a similar system (in a hurry) in a small
> company. The bulk data was compressed and it was made available to the
> applications via NFS (IPoIB). The applications were responsible
>     for decompressing and pre/post-processing the data. Later, one of the
> developers created a PostgreSQL based system to hold all the data, he
> used C++ for all the data handling. That system was never
>     used, even though all the historical data was loaded into the
> database!
>
>     Your choice of components is going to depend on how your analytics
> software are going to access the data. If the data are being read and
> processed once, then loading into a database, then querying it
>     once may not pay off.
>
>     Cheers,
>     Fred
>
>     On 04/03/2019 09:24, Jonathan Aquilina wrote:
>     > Hi Fred,
>     >
>     > I and my colleague had done some research and found an extension for
> postgresql called timescaleDB, but then upon further research
> postgres on its own is good for such data as well. The thing is
> these are not going to be given to use as the data is coming in but
> in bulk at the end from the parent company.
>     >
>     > Have you used postgresql for such type's of data and how has it
> performed?
>     >
>     > Regards,
>     > Jonathan
>     >
>     > On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie"
> <beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>     >
>     >      Hi Jonathan,
>     >
>     >      It seems you're collecting metrics and time series data.
> Perhaps a time series database (TSDB) is an option for you.
> There are a few of these out there, but I don't have any
> personal recommendation.
>     >
>     >      Cheers,
>     >      Fred
>     >
>     >      On 04/03/2019 07:04, Jonathan Aquilina wrote:
>     >      > These would be numerical data such as integers or floating
> point numbers.
>     >      >
>     >      > -----Original Message-----
>     >      > From: Tony Brian Albers <tba at kb.dk>
>     >      > Sent: 04 March 2019 08:04
>     >      > To: beowulf at beowulf.org; Jonathan Aquilina
> <jaquilina at eagleeyet.net>
>     >      > Subject: Re: [Beowulf] Large amounts of data to store and
> process
>     >      >
>     >      > Hi Jonathan,
>     >      >
>     >      >  From my limited knowledge of the technologies, I would say
> that HBase with file pointers to the files placed on HDFS
> would suit you well.
>     >      >
>     >      > But if the files are log files, consider some tools that are
> suited for analyzing those like Kibana.
>     >      >
>     >      > /tony
>     >      >
>     >      >
>     >      > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
>     >      >> Hi Tony,
>     >      >>
>     >      >> Sadly I cant go into much detail due to me being under an
> NDA. At this
>     >      >> point with the prototype we have around 250gb of sample data
> but again
>     >      >> this data is dependent on the type of air craft. Larger
> aircraft and
>     >      >> longer flights will generate a lot more data as they have
> more
>     >      >> sensors and will log more data than the sample data that I
> have. The
>     >      >> sample data is 250gb for 35 aircraft of the same type.
>     >      >>
>     >      >> Regards,
>     >      >> Jonathan
>     >      >>
>     >      >> -----Original Message-----
>     >      >> From: Tony Brian Albers <tba at kb.dk>
>     >      >> Sent: 04 March 2019 07:48
>     >      >> To: beowulf at beowulf.org; Jonathan Aquilina
> <jaquilina at eagleeyet.net>
>     >      >> Subject: Re: [Beowulf] Large amounts of data to store and
> process
>     >      >>
>     >      >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
>     >      >>> Good Morning all,
>     >      >>>
>     >      >>> I am working on a project that I sadly cant go into much
> detail but
>     >      >>> there will be quite large amounts of data that will be
> ingested by
>     >      >>> this system and would need to be efficiently returned as
> output to
>     >      >>> the end user in around 10 min or so. I am in discussions
> with
>     >      >>> another partner involved in this project about the best way
> forward
>     >      >>> on this.
>     >      >>>
>     >      >>> For me given the amount of data (and it is a huge amount of
> data)
>     >      >>> that an RDBMS such as postgresql would be a major bottle
> neck.
>     >      >>> Another thing that was considered flat files, and I think
> the best
>     >      >>> for that would be a Hadoop cluster with HDFS. But in the
> case of HPC
>     >      >>> how can such an environment help in terms of ingesting and
> analytics
>     >      >>> of large amounts of data? Would said flat files of data be
> put on a
>     >      >>> SAN/NAS or something and through an NFS share accessed that
> way for
>     >      >>> computational purposes?
>     >      >>>
>     >      >>> Regards,
>     >      >>> Jonathan
>     >      >>> _______________________________________________
>     >      >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin
>     >      >>> Computing To change your subscription (digest mode or
> unsubscribe)
>     >      >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
>     >      >>
>     >      >> Good morning,
>     >      >>
>     >      >> Around here, we're using HBase for similar purposes. We have
> a bunch
>     >      >> of smaller nodes storing the data and all the management
> nodes(ambari,
>     >      >> HDFS namenodes etc.) are vm's.
>     >      >>
>     >      >> Our nodes are configured so that we have a maximum of 2
> cores per disk
>     >      >> spindle and 4G of memory for each core. This seems to do the
> trick and
>     >      >> is pretty responsive.
>     >      >>
>     >      >> But to be able to provide better advice, you will probably
> need to go
>     >      >> into a bit more detail about what types of data you will be
> storing
>     >      >> and which kind of calculations you want to perform.
>     >      >>
>     >      >> /tony
>     >      >>
>     >      >>
>     >      >> --
>     >      >> Tony Albers - Systems Architect - IT Development Royal
> Danish Library,
>     >      >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     >      >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     >      >
>     >      > --
>     >      > Tony Albers - Systems Architect - IT Development Royal Danish
> Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     >      > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     >      > _______________________________________________
>     >      > Beowulf mailing list, Beowulf at beowulf.org sponsored by
> Penguin Computing
>     >      > To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf
>     >      >
>     >      _______________________________________________
>     >      Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     >      To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     >
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

--
Doug