[Beowulf] Large amounts of data to store and process

Michael Di Domenico mdidomenico4 at gmail.com
Mon Mar 4 05:12:55 PST 2019


even though you've alluded to this being time series data.  is there a
requirement that you have to index into the data or is just read the
data end-to-end and do some calculations.

i routinely face these kind of issues, but we're not indexing into the
data, so having things in hdfs or rdbms doesn't give us any benefit.
we pull all the data into organized flat files and blow through them
with HTCondor.  if the researcher wants to tweak the code they do and
then just rerun the whole simulation.

sometimes that's minutes sometimes days.  but in either case the time
to develop code is always much shorter because the data is in flat
files and easier for my "non-programmer" programmers.  no need to
learn hdfs/hadoop or sql

if you need to index the data and jump around, hdfs is probably still
not the best solution unless you want index the files and 250gb isn't
really big enough to warrant an hdfs cluster.  i've generally found
unless you're dealing with multi-TB+ datasets you can't scale the
hardware out enough to get the speed up.  (yes, i know there are
tweaks to change this, but I've found its just simpler to buy a bigger
lustre system)



On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina
<jaquilina at eagleeyet.net> wrote:
>
> Good Morning all,
>
>
>
> I am working on a project that I sadly cant go into much detail but there will be quite large amounts of data that will be ingested by this system and would need to be efficiently returned as output to the end user in around 10 min or so. I am in discussions with another partner involved in this project about the best way forward on this.
>
>
>
> For me given the amount of data (and it is a huge amount of data) that an RDBMS such as postgresql would be a major bottle neck. Another thing that was considered flat files, and I think the best for that would be a Hadoop cluster with HDFS. But in the case of HPC how can such an environment help in terms of ingesting and analytics of large amounts of data? Would said flat files of data be put on a SAN/NAS or something and through an NFS share accessed that way for computational purposes?
>
>
>
> Regards,
>
> Jonathan
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list