Back to Writing
homelab

Diving into Ceph: Building a Distributed Storage Cluster that Doesn't Suck

Notes from trying to stand up Ceph in a 3-node lab and why future-me expects pain.

I've been running CloudStack for a while, but the storage situation has been... let's just say less than ideal. I was running simple RAID-6 on each of the servers and a single drive failure cascaded into 5 drive failures on one machine from trying to rebuild the array.

I lost a bunch of data, but had the important stuff like configs backed up to Wasabi and decided I needed to implement a real storage solution or this was going to just be an ongoing problem. As an added bonus going to a distributed SAN style storage layout would mean I could be opened up to things like nearly instant live migrations, easier maintainance on the KVM hosts, and no longer depending on finicky and slow NFS for VM storage.

For those not familiar, Ceph is an open-source distributed storage system that provides object, block, and file storage in a single unified platform. It's designed for fault tolerance and can scale to ridiculous levels, which is why big cloud providers use it or something similar. For my modest homelab with 3 Dell R730xd servers, it's absolutely overkill - which is precisely why I wanted to set it up.

The Hardware Setup

First, let me walk through what I'm working with:

  • 3x Dell R730xd servers
  • Each server has a mix of 6tb and 3tb drives
  • A dedicated 10Gb network between the servers for the Ceph traffic

This gives me a raw capacity of about 144TB. but the actual usable space depends on the replication factor and encoding schemes, which I'll get to.

For networking, the 10G is plenty for a cluster this size, but I wouldn't even consider running Ceph without dedicated bandwidth for cluster traffic or anything less than 10G. I've played with Ceph in the past before I had access to 10G lab gear and it was simply unusable, even in a small deployment. I've configured separate VLANs and dedicated physical interfaces for Ceph cluster network. It's not just about bandwidth - it's about consistent latency.

Planning the Cluster

I decided to run monitors and managers on all three nodes for high availability, OSDs on all the data drives, and MDS for a shared filesystem that I might need later. I just wanted the full suite of options configured early so implementing them later wouldn't require me to dig back into the storage configuration and risk borking something and losing data.

Installation Hell

Installation was... interesting. I initially tried to use ceph-deploy, but quickly switched to using the native Debian packages with manual configuration. My first attempt ended with this beautiful error:

Error ENOENT: all MONs failed to join, quorum still forming

After a couple hours of troubleshooting, I realized I had a clock synchronization issue - the nodes were off by just enough to cause problems. Setting up chrony properly fixed that:

apt install chrony
echo 'server 0.pool.ntp.org iburst' >> /etc/chrony/chrony.conf
echo 'server 1.pool.ntp.org iburst' >> /etc/chrony/chrony.conf
systemctl restart chrony

Once I had the monitors up and running, I created the OSDs. I decided to use a 3x replication factor for most pools, which means each piece of data is stored on three different OSDs. This gives me fault tolerance against drive failures or even an entire node going down.

 
# For each drive on each node
 
ceph-volume lvm create --data /dev/sdX

Doing this 36 times (12 drives × 3 nodes) was tedious, so I wrote a quick script:

for disk in b c d e f g h i j k l m; do
ceph-volume lvm create --data /dev/sd$disk
done

The Moment of Truth

After setting up all the OSDs and letting the cluster rebalance, I ran ceph -s and finally saw:

cluster:
id: 17d60878-3920-11f0-99e0-90e2badecc10
health: HEALTH_OK

Never have I been so happy to see "HEALTH_OK" in my life. The full status now shows:

  • 3 monitor daemons in quorum
  • 3 manager daemons (1 active, 2 standby)
  • 2 MDS daemons active with 1 standby
  • 32 OSDs all up and in (I decided not to use 4 of the drives for future expansion)
  • 2 RGW daemons for S3-compatible storage
  • 145TB raw capacity with 127TB available (after accounting for replication)

Creating Pools and Integration with CloudStack

I didn't really need many pools as segmenting storage is actually gonna happen at the CloudStack level, but I of course needed to give CloudStack what it needed.

ceph osd pool create images 128 128
ceph osd pool create volumes 128 128
ceph osd pool create backups 64 64

And for my media collection (movies and TV shows), I decided to use erasure coding instead of replication:

ceph osd pool create media 64 64 erasure
ceph osd erasure-code-profile set media k=6 m=2
ceph osd pool set media allow_ec_overwrites true

I went with a 6+2 erasure coding scheme, which means data is split into 6 chunks with 2 coding chunks. This configuration can survive losing any 2 OSDs without data loss, but uses only 1.33x the original data size instead of 3x with replication. For my media files which are large, read-mostly, and not performance-critical, this is perfect - I get more usable space without sacrificing too much redundancy.

For CloudStack integration, I went with RBD (RADOS Block Device) since it provides block storage that CloudStack can use for VM disks. The integration was fairly straightforward:

  1. Install the RBD client on all CloudStack hosts
  2. Configure RBD credentials and pool information in CloudStack
  3. Create a primary storage in CloudStack pointing to the Ceph cluster

The most annoying part was getting the permissions right. I had to create a specific Ceph user for CloudStack:

ceph auth get-or-create client.cloudstack mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=images'

Performance Testing

I ran some performance tests using fio on an actual VM with an RBD-backed disk:

fio --name=test --ioengine=libaio --direct=1 --bs=4m --iodepth=16 --size=10g --rw=write

Here are the actual results:

test: (groupid=0, jobs=1): err= 0: pid=1071090: Tue Oct 28 17:50:01 2025
write: IOPS=57, BW=229MiB/s (240MB/s)(10.0GiB/44706msec); 0 zone resets
slat (usec): min=260, max=4469, avg=726.65, stdev=190.97
clat (msec): min=72, max=1033, avg=278.51, stdev=133.45
lat (msec): min=72, max=1034, avg=279.24, stdev=133.45

Honestly, 229MB/s sustained write isn't terrible for a 3x replicated pool over 10G with consumer SATA drives. I was initially hoping for more like 300-350MB/s, but considering each write is actually writing 3x the data across the network, it's not bad. The latency numbers (avg 279ms) aren't great, but that's expected for replicated distributed storage.

What's most interesting to me is the latency distribution:

clat percentiles (msec):
| 1.00th=[ 94], 5.00th=[ 125], 10.00th=[ 142], 20.00th=[ 171],
| 30.00th=[ 197], 40.00th=[ 220], 50.00th=[ 249], 60.00th=[ 279],
| 70.00th=[ 321], 80.00th=[ 372], 90.00th=[ 451], 95.00th=[ 527],
| 99.00th=[ 718], 99.50th=[ 860], 99.90th=[ 961], 99.95th=[ 995],
| 99.99th=[ 1036]

The 99th percentile at 718ms means there are some occasional slow operations, likely when the cluster is doing background work or when writes hit certain OSDs harder than others. For most workloads this won't matter, but it might impact latency-sensitive applications.

The IOPS are low at just 57, but that's expected with 4MB block size. This configuration is optimized for throughput, not IOPS. For most VM workloads, I'll be using much smaller block sizes which should yield higher IOPS at the cost of throughput.

Current Status

The cluster has been running for about a week now, and it's been rock solid. I've migrated most of my VMs to use Ceph-backed storage, and I'm in the process of setting up automated backups using RBD snapshots.

One thing that caught me completely off guard was Ceph's insatiable appetite for RAM. Ceph is consistently using around 20GB of RAM on each of the cluster nodes during normal operation, but during large operations like rebalancing or recovery, that number spikes to a whopping 32GB or more. Each OSD seems to want 2-4GB for itself, and when you multiply that by 32 OSDs, it adds up fast. The BlueStore backend that Ceph now uses by default is significantly more RAM-hungry than the old FileStore, which is something I didn't fully appreciate during planning. Each of the servers have 64GB of RAM, and Ceph is probably going to warrant an upgrade but RAM for these older machines is super fucking expensive. Despite the resource consumption, the memory is clearly being put to good use with caching and metadata handling, which helps explain those performance numbers.

Current stats:

  • ~11TB of data stored
  • 3.03M objects
  • 17TB used, 127TB available

I'm still turning knobs and dicking with paramaters, particularly around PG autoscaling and the CRUSH map, but overall I'm satisfied with how it's working. Having distributed storage is worth the marginal performance cost which, I don't have tests, but feels like it's actually faster than when I was using RAID and NFS.

In a future post, I'll dive deeper into how CloudStack is using this storage and the automation I've set up around it. For now, I'm just happy to have a storage solution that doesn't make me nervous every time I hear a drive click.

Game of Life

Wikipedia
Slow150msFast

Patterns