[Beowulf] Large amounts of data to store and process

Douglas Eadline deadline at eadline.org
Mon Mar 4 07:45:44 PST 2019

> Good Morning all,
> I am working on a project that I sadly cant go into much detail but there
> will be quite large amounts of data that will be ingested by this system
> and would need to be efficiently returned as output to the end user in
> around 10 min or so. I am in discussions with another partner involved in
> this project about the best way forward on this.
> For me given the amount of data (and it is a huge amount of data) that an
> RDBMS such as postgresql would be a major bottle neck. Another thing that
> was considered flat files, and I think the best for that would be a Hadoop
> cluster with HDFS. But in the case of HPC how can such an environment help
> in terms of ingesting and analytics of large amounts of data? Would said
> flat files of data be put on a SAN/NAS or something and through an NFS
> share accessed that way for computational purposes?

With HDFS think Software defined storage layer on top of an existing
FS. It is designed to store large amounts of data. The write step is
not necessarily fast because there is often replication factor of
3 or more (this provides a very robust FS that can
tolerate failures.)

The other thing about "Hadoop/Spark" etc. is the "data lake concept."
Raw data are written to HDFS (or alternative) as fast as possible
and the eventual ETL (Exact Transform and Load) step is often one
of the first stages of the an analytics application. Indeed, often times
this step is where large MapReduce resources are most useful,
once data is "prepped" and the feature matrix is created, the models
can usually be run with moderate resources.

Running HPC type applications on the data in HDFS depends on
your application details. Using Hadoop to "munge" data to
create an input data set available for further processing
is certainly possible. There are tools to move too and from
RDMS into HDFS (Sqoop) or you could some Hadoop
things like HBase or Hive, if that fits your type of analysis.

Hope that helps.


> Regards,
> Jonathan
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list