[Beowulf] Suggestions to what DFS to use

Tony Brian Albers tba at kb.dk
Tue Feb 14 04:57:55 PST 2017

On 2017-02-13 20:45, Ellis H. Wilson III wrote:
> On 02/13/17 14:00, Greg Lindahl wrote:
>> On Mon, Feb 13, 2017 at 07:55:43AM +0000, Tony Brian Albers wrote:
>>> Hi guys,
>>> So, we're running a small(as in a small number of nodes(10), not
>>> storage(170TB)) hadoop cluster here. Right now we're on IBM Spectrum
>>> Scale(GPFS) which works fine and has POSIX support. On top of GPFS we
>>> have a GPFS transparency connector so that HDFS uses GPFS.
>> I don't understand the question. Hadoop comes with HDFS, and HDFS runs
>> happily on top of shared-nothing, direct-attach storage. Is there
>> something about your hardware or usage that makes this a non-starter?
>> If so, that might help folks make better suggestions.
> I'm guessing the "POSIX support" is the piece that's missing with a
> native HDFS installation.  You can kinda-sorta get a form of it with
> plug-ins, but it's not a first-class citizen like in most DFS and when I
> used it last it was not performant.  Native HDFS makes large datasets
> expensive to work with in anything but Hadoop-ready (largely MR)
> applications.  If there is a mixed workload, having a filesystem that
> can support both POSIX access and HDFS /without/ copies is invaluable.
> With extremely large datasets (170TB is not that huge anymore), copies
> may be a non-starter.  With dated codebases or applications that don't
> fit the MR model, complete movement to HDFS may also be a non-starter.
> The questions I feel need to be answered here to get good answers rather
> than a shotgun full of random DFS's are:
> 1. How much time and effort are you willing to commit to setup and
> administration of the DFS?  For many completely open source solutions
> (Lustre and HDFS come to mind) setup and more critically maintenance can
> become quite heavyweight, and performance tuning can grow to
> summer-grad-student-internship level.
> 2. Are you looking to replace the hardware, or just the DFS?  These
> days, 170 TB is at the fringes (IMHO) of what can fit reasonably into a
> single (albeit rather large) box.  It wouldn't be completely unthinkable
> to run all of your storage with ZFS/BTRFS, a very beefy server,
> redundant 10, 25 or 40GE NICs, some SSD acceleration, a UPS, and
> plain-jane NFS (or your protocol of choice out of most Linux distros).
> You could even host the HDFS daemons on that node, pointing at POSIX
> paths rather than devices.  But this falls into the category of "host it
> yourself," so that might be too much work.
> 3. How committed to HDFS are you (i.e., what features of it do your
> applications actually leverage)?  Many map reduce applications actually
> have zero attachment to HDFS whatsoever.  You can reasonably re-point
> them at posix-complaint NAS and they'll "just work."  Plus you get
> cross-protocol access to the files without any wizardry, copying, etc.
> HBase is a notable example of where they've built dependence on HDFS
> into the code, but that's more the exception than the norm.
> Best,
> ellis
> Disclaimer: I work for Panasas, a storage appliance vendor.  I don't
> think I'm shamelessly plugging anywhere above as I love when people host
> themselves, but it's not for everybody.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

1) Pretty much whatever it takes. We have the mentioned cluster, a 
second one running only hbase(for now) and the third is a storage 
cluster for our DSpace installation which will probably grow to tens of 
petabytes within a couple of years. To be able to use the same FS on all 
would be nice. (yes I know, there's probably not a swiss-knife -but we 
are willing to compromise)

2) Just the DFS (having issues with IBM support(not on the DFS alone)).

3) HBase. Doesn't work without HDFS AFAIK.


Best regards,

Tony Albers
Systems administrator, IT-development
Royal Danish Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark.
Tel: +45 2566 2383 / +45 8946 2316

More information about the Beowulf mailing list