<div dir="ltr"><div><div><div>Hi,<br><br></div>A bit late to the discussion, but I am currently setting up Ceph for our cluster storage and wanted to throw in my 2 cents.<br><br></div><div>It is important to realize that Ceph provides many layers of storage services. At the lowest level we have the object storage layer (RADOS) which can be accessed through the "RADOS Gate Way" (RGW) using an S3/Swift compatible REST API.  On top of the object storage there is the block device layer "RADOS Block Device" (RBD) and the filesystem layer CephFS.  Out of these, CephFS is the only one that is not widely regarded as production ready (though many people find it suitable for their needs already).<br><br></div>I am planning to use the block layer to provide storage for virtual machines that in turn provide various storage services (mostly NFS and Samba but also more specialized storage).  Storage performance for any single job in our cluster is less of a concern than aggregate performance, which makes it acceptable to extract additional concurrency by splitting data across these virtual machines. We can essentially provide a storage VM for each group and thus avoid issues with one group overloading the storage system and causing problems for other users.  The block devices can be thin provisioned which allows you to allocate a large XFS or ext4 file system but only store blocks that are actually used. We also get cheap COW snapshots for short term backups and as the available space goes down we can scale storage and performance by just adding nodes.</div><div><br></div><div>The big downside to this approach is that high availability in the storage services needs to be provided through traditional failover techniques.  This is one of the places where CephFS would be a huge improvement.</div><div><br></div><div>Perhaps most importantly, regardless of which storage layer you are using, Ceph is extremely flexible in how data is ultimately stored. Intelligent data placement with respect to failure domains (using the CRUSH algorithm) is an integral part of Ceph. You can ensure that different replicas of each object are stored on different servers/racks/switches/etc. You can use erasure coding instead replication to boost storage efficiency, essentially providing distributed parity RAID. There are even erasure codes that allow single drive errors to be fixed within a single server/rack/etc to reduce network load. You can use a small pool of fast drives to cache large pool of slow drives. This flexibility is really the most interesting aspect of Ceph to me.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 15, 2015 at 6:56 AM, Olli-Pekka Lehto <span dir="ltr"><<a href="mailto:olli-pekka.lehto@csc.fi" target="_blank">olli-pekka.lehto@csc.fi</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 15 Apr 2015, at 06:50, Mark Hahn <<a href="mailto:hahn@mcmaster.ca">hahn@mcmaster.ca</a>> wrote:<br>

<br>

>> In an environment that needs to adapt to evolving user needs, trading some<br>

>> performance for the flexibility that Ceph offers does not seem like a bad<br>

>> deal.<br>

><br>

> it would be appreciated if you could be a bit more specific.  what kind of performance, what kind of flexibility?<br>

><br>

> thanks, mark hahn.<br>

<br>

</span>Sure!<br>

<br>

To give some background, we have two types of environments with different granularity of funding and customership:<br>

<br>

1. HPC environment:<br>

We get a big chunk of funding every few years that needs to be invested within a limited time. The need is for fast parallel storage. Thus big, enterprise class storage boxes with Lustre. The system and SLA will remain fairly static for several years. Growth is fairly predictable.<br>

<br>

2. Cloud environment:<br>

Ongoing streams of small-medium funding from various customers. Some of these can be sold services and some need to show an investment for the research-granting organization. The needs of price-performance-resilience-capacity might be different for different customers. Growth is unpredictable.<br>

<br>

For the first case the Lustre model works fine but for the latter it can be a bit more constrained: For this we should be able to grow our compute and storage capacity smoothly even for cases where the funding is fine-grained, while keeping the architecture simple. Also the workload profiles and resiliency  requirements are not completely clear for future workloads.<br>

<br>

With Ceph we can scale storage in a way that’s more akin to the one that we scale compute nodes: We can throw more nodes at it to make it grow in a fairly linear fashion and with a fine granularity. We can also adjust resiliency parameters in software instead of having a large part of it fixed in the hardware design.<br>

<br>

I don’t see Lustre going away, at least in our environments, anytime soon and we have not done any real apples-to-apples comparisons yet on performance. Initially we’re not targeting huge scalability or performance. Basically something that is better than NFS is good enough initially.<br>

<br>

It’s also interesting to see how the resiliency will compare. Having experienced multiple generations of expensive “invincible” arrays having issues that baffle us (and often the vendors) time after time, something with cheaper but more decoupled HW might turn out to be better.<br>

<span class="HOEnZb"><font color="#888888"><br>

O-P<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</div></div></blockquote></div><br></div>