[Beowulf] Torrents for HPC
Bernd Schubert
bs_lists at aakef.fastmail.fm
Thu Jun 14 09:14:27 PDT 2012
On 06/13/2012 11:59 PM, Bill Broadley wrote:
> On 06/13/2012 06:40 AM, Bernd Schubert wrote:
>> What about an easy to setup cluster file system such as FhGFS?
>
> Great suggestion. I'm all for a generally useful parallel file systems
> instead of torrent solution with a very narrow use case.
>
>> As one of
>> its developers I'm a bit biased of course, but then I'm also familiar
>
> I think this list is exactly the place where a developer should jump in
> and suggest/explain their solutions as it related to use in HPC clusters.
>
>> with Lustre, an I think FhGFS is far more easiy to setup. We also do not
>> have the problem to run clients and servers on the same node and so of
>> our customers make heavy use of that and use their compute nodes as
>> storage servers. That should a provide the same or better throughput as
>> your torrent system.
>
> I found the wiki, the "view flyer", FAQ, and related.
>
> I had a few questions, I found this link
> http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#ha_support but was not
> sure of the details.
>
> What happens when a metadata server dies?
>
> What happens when a storage server dies?
Right, those two issues we are presently actively working on. So the
current release relies on hardware raid. But later on this year there
will be meta data mirroring. After that data mirroring will follow.
>
> If either above is data loss/failure/unreadable files is there a
> description of how to improve against this with drbd+heartbeat or
> equivalent?
During the next weeks we will test fhgfs-ocf scripts for an HA
(pacemaker) installation. As we are going to be paid for the
installation, I do no know yet when we will make those scripts
publically available. Generally drbd+heartbeat as mirroring solution is
possible.
>
> Sounds like source is not available, and only binaries for CentOS?
Well, RHEL5 / RHEL6 based, SLES10 / SLES11 and Debian. And sorry, the
server daemons are not open source yet. I think the more people asking
to open it, the faster this process will be. Especially if those people
also are going to buy support contracts :)
>
> Looks like it does need a kernel module, does that mean only old 2.6.X
> CentOS kernels are supported?
Oh, on the contrary. We basically support any kernel beginning with
2.6.16 onwards. Even support for most recent vanilla kernels is usually
done within a few weeks after its release.
>
> Does it work with mainline ofed on qlogic and mellanox hardware?
Definitely works with both and RDMA (ibverbs) transfers.
As QLogic has some problems with ibverbs, we had a cooperation with
QLogic to improve performance on their hardware. Recent QLogic OFED
stacks do include performance fixes.
Please also see
http://www.fhgfs.com/wiki/wikka.php?wakka=NativeInfinibandSupport
for (QLogic) tuning advises.
>
> From a sysadmin point of view I'm also interested in:
> * Do blocks auto balance across storage nodes?
Actually files are balanced. The default file stripe count is 4, but can
be adjusted by the admin. So assuming you would have only one target per
server, a large file would be distributed over 4 nodes. The default
chunk size is 512kB. For files smaller than that size there is no
stripe-overhead.
> * Is managing disk space, inodes (or equiv) and related capacity
> planning complex? Or does df report useful/obvious numbers?
Hmm, right now (unix) "df -i" does not report the inode usage yet for
fhgfs. We will fix that in later releases.
At least for traditional storage severs we recommend to use ext4 on
meta-data partitions for performance reasons. For storage partitions we
usually recommend XFS, again for performance.
Also, storage and meta-data can be on the very same partion, you just
need configure the path were to find those data in the corresponding
config files.
If you are going to use all your client nodes as fhgfs servers and those
already have XFS as scratch partion, XFS is probably also fine. However,
due a severe XFS performance issue, you should either need a kernel to
have this issue fixed or you should disable meta-data-as-xattr
(in fhgfs-meta.conf: storeUseExtendedAttribs = false).
Also please see here for a discussion and benchmarks
http://oss.sgi.com/archives/xfs/2011-08/msg00233.html
Christoph Hellwig then fixed the unlink issue later on and this patch
should be in all recent linux-stable kernels. I have not checked
RHEL5/RHEL6, though.
Anyway, if you are going use ext4 on your meta-data partition, you need
to make sure yourself you do have sufficient inodes available. Our wiki
has recommendations for mkfs.ext4 options.
> * Can storage nodes be added/removed easily by migrating on/off of
> hardware?
Adding storage nodes on the fly works perfectly fine. Our fhgfs-ctl tool
also has a mode to migrate files off a storage node. However, we really
recommend not to do that while clients are writing to the file system
right now. Reason is that we do not lock files-in-migration yet and a
client then might write to unlinked files, which would result in silent
data loss. We have on-the-fly data migration on our todo list, but I
cannot say yet, when that is going to come.
If you are going to use your clients as storage nodes, you could specify
that system as preferred system to write files to. That would easily
allow to remove that node...
> * Is FhGFS handle 100% of the distributed file system responsibilities
> or does it layer on top of xfs/ext4 or related? (like ceph)
Like ceph on top of other file systems, such as xfs or ext4.
> * With large files does performance scale reasonably with storage
> servers?
Yes, you may also adjust the stripe count by your needs. Default stripe
count is 4, which approximately provides the performance of 4 storage
targets.
> * With small files does performance scale reasonably with metadata
> servers?
Striping over different meta data servers is done on a per-directory
basis. As most users and applications work in different directories,
meta data performance usually scales linearily with the number of
metadata servers.
Please note: Our wiki has tuning advices for meta data performance and
with our next major release we also should see a greatly improved meta
data performance.
Hope it helps and please let me know if you have further questions!
Cheers,
Bernd
PS: We have a GUI, which should help you to just try it out within a few
minutes. Please see here:
http://www.fhgfs.com/wiki/wikka.php?wakka=GUIbasedInstallation
More information about the Beowulf
mailing list