[Beowulf] Large amounts of data to store and process
Jonathan Engwall
engwalljonathanthereal at gmail.com
Mon Mar 4 06:10:47 PST 2019
What does your overall design look like?
On Mon, Mar 4, 2019, 5:19 AM Jonathan Aquilina <jaquilina at eagleeyet.net>
wrote:
> Hi Michael,
>
> As previously mentioned we don’t really need to have anything indexed so I
> am thinking flat files are the way to go my only concern is the performance
> of large flat files. Isnt that what HDFS is for to deal with large flat
> files.
>
> On 04/03/2019, 14:13, "Beowulf on behalf of Michael Di Domenico" <
> beowulf-bounces at beowulf.org on behalf of mdidomenico4 at gmail.com> wrote:
>
> even though you've alluded to this being time series data. is there a
> requirement that you have to index into the data or is just read the
> data end-to-end and do some calculations.
>
> i routinely face these kind of issues, but we're not indexing into the
> data, so having things in hdfs or rdbms doesn't give us any benefit.
> we pull all the data into organized flat files and blow through them
> with HTCondor. if the researcher wants to tweak the code they do and
> then just rerun the whole simulation.
>
> sometimes that's minutes sometimes days. but in either case the time
> to develop code is always much shorter because the data is in flat
> files and easier for my "non-programmer" programmers. no need to
> learn hdfs/hadoop or sql
>
> if you need to index the data and jump around, hdfs is probably still
> not the best solution unless you want index the files and 250gb isn't
> really big enough to warrant an hdfs cluster. i've generally found
> unless you're dealing with multi-TB+ datasets you can't scale the
> hardware out enough to get the speed up. (yes, i know there are
> tweaks to change this, but I've found its just simpler to buy a bigger
> lustre system)
>
>
>
> On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina
> <jaquilina at eagleeyet.net> wrote:
> >
> > Good Morning all,
> >
> >
> >
> > I am working on a project that I sadly cant go into much detail but
> there will be quite large amounts of data that will be ingested by this
> system and would need to be efficiently returned as output to the end user
> in around 10 min or so. I am in discussions with another partner involved
> in this project about the best way forward on this.
> >
> >
> >
> > For me given the amount of data (and it is a huge amount of data)
> that an RDBMS such as postgresql would be a major bottle neck. Another
> thing that was considered flat files, and I think the best for that would
> be a Hadoop cluster with HDFS. But in the case of HPC how can such an
> environment help in terms of ingesting and analytics of large amounts of
> data? Would said flat files of data be put on a SAN/NAS or something and
> through an NFS share accessed that way for computational purposes?
> >
> >
> >
> > Regards,
> >
> > Jonathan
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/ce92a4a3/attachment.html>
More information about the Beowulf
mailing list