Ceph @ BIT

Stefan Kooman, stefan@bit.nl, @basseroet BIT

Agenda

From Storage vendor to Open Source Storage
Requirements for new storage solution
Ceph to the rescue!
What will we use Ceph for?
Timeline Ceph @ BIT
Design production cluster
Build the cluster
Development challenges

From Storage vendor to Open Source Storage

Reasons to migrate

Performance (lack thereof)
Does not scale horizontally
Lots of critical bugs
Lots of maintenance
Not happy with support

Looking for alternatives

Laws of the Widos 1/2

Wido Potters (Manager CCS @ BIT):

1) The first 6 months you will be in love with your new storage from vendor X
2) After 6 months cracks arise in the relationship
3) After one year you are totally fed up with your storage solution

Functional Requirements

Redundant Block Storage (iSCSI)
Redundant File Storage (nfs)
Redundant Object Storage (S3)
Cluster remain functional in case of a Data center failure
low maintenance
stable platform
horizontally scalable (makes sense for economics too)

Considerations

Eat your own dog food
No difference between BIT / Customers

Ceph to the rescue!

Reduindant Block Storage: RBD
Redundant File Storage: Cephfs
Redundant Object Storage: RGW

Hey, we need support for legacy applications

NFS Ganesha on top of Cephfs
SuSE iSCSI Gateway (lrbd) on top of Ceph rbd
Ceph iSCSI Gateway on top of Ceph rbd

Ceph @ BIT Timeline

Ceph Test cluster

(I made this mess, don't blame DC management)

Ceph Test Cluster

Useful to test switches
Get familiar with Ceph
Test Ceph upgrades
Test failures
Plan for production

Design Production Cluster

IPv6 Only (Ceph not dual stack)
Access to Ceph storage through BIT Access Network
Triple replicas: Datacenter BIT-1, BIT-2A, BIT-2C
BGP Routed
BGP ECMP across links (easy scaling bandwith)
BGP EVPN (VXlan)
Redundant paths between DC's (two separate fiber rings)
No better kill than overkill! :p

Ceph Production Cluster

Build Production Cluster

Laws of the Widos 2/2

Ceph IPv6
one (public) network
one interface
one IPv6 address

OOB Management

Linux network namespaces
Complete isolation between network stacks

mgmt namespace


/sbin/ip netns add mgmt
/sbin/ip link set $INTERFACE netns mgmt
/sbin/ip netns exec mgmt /sbin/ip link set lo up
/sbin/ip netns exec mgmt /sbin/ip link set $INTERFACE up
/sbin/ip netns exec mgmt /sbin/ip -6 addr add 2001:deadbeef/64 dev $INTERFACE
/sbin/ip netns exec mgmt /sbin/ip -6 route del default
/sbin/ip netns exec mgmt /sbin/ip -6 route add default via $GATEWAY6

Why a seperate management network?

Troubleshoot Ceph / network issues through ssh

Always Collect metrics

... to ease future changes

DPDK in Ceph

Make sure daemons live in right namespace

sshd

rsyslogd

bit-monitoring

telegraf

snappy

more to come?

Infra as code

GIT (instead of SVN)
Ansible (instead of cfengine2)
GITLAB (workflow)

Switch infra

Storage network similar to access network
Provisioned in same way (GIT, Ansible, GITLAB)
OOB for switches (Arista) implemented in same way (namespaces)
IPv6 only OOB

Monitoring / Metrics

Near real-time info
Alerting

Monitoring / Metrics

Icinga (alerting)
InfluxDB (storing metrics)
Grafana (viewing metrics)
Telegraf (collecting metrics)
Snappy (collecting Ceph metrics)
AMP (continuously perform active network measurements between a mesh of Ceph nodes)

Ceph tools

ceph -s
ceph osd tree

ceph -s


  cluster:
    id:     7d44bbde-2442-59dc-8c71-e7dad785c99b
    health: HEALTH_WARN
            1 datacenter (21 osds) down
            21 osds down
            3 hosts (21 osds) down
            1 rack (21 osds) down
            Degraded data redundancy: 19469/56142 objects degraded (34.678%), 2048 pgs unclean, 2048 pgs degraded, 2048 pgs undersized
            2/5 mons down, quorum mon5,mon3,mon4

services:
    mon: 5 daemons, quorum mon5,mon3,mon4, out of quorum: mon1, mon2
    mgr: mon4(active), standbys: mon3, mon5
    osd: 63 osds: 41 up, 62 in

ceph osd tree


ceph osd tree
ID  CLASS WEIGHT   TYPE NAME              STATUS REWEIGHT PRI-AFF 
-21       69.29997 region BIT-Ede                                 
-23       23.09999     datacenter BIT-1                           
-26       23.09999         rack rack1                             
 -3        7.70000             host osd1                          
  0   ssd  1.09999                 osd.0    down  1.00000 1.00000 
  1   ssd  1.09999                 osd.1    down  1.00000 1.00000 
  2   ssd  1.09999                 osd.2    down  1.00000 1.00000 
  3   ssd  1.09999                 osd.3    down  1.00000 1.00000 
  4   ssd  1.09999                 osd.4    down  1.00000 1.00000 
  5   ssd  1.09999                 osd.5    down  1.00000 1.00000 
  6   ssd  1.09999                 osd.6    down  1.00000 1.00000 
 -2        7.70000             host osd2                          
  7   ssd  1.09999                 osd.7    down  1.00000 1.00000 
  8   ssd  1.09999                 osd.8    down  1.00000 1.00000 
  9   ssd  1.09999                 osd.9    down  1.00000 1.00000 
 10   ssd  1.09999                 osd.10   down  1.00000 1.00000 
 11   ssd  1.09999                 osd.11   down  1.00000 1.00000 
 12   ssd  1.09999                 osd.12   down  1.00000 1.00000 
 13   ssd  1.09999                 osd.13   down  1.00000 1.00000 
 -5        7.70000             host osd3                          
 14   ssd  1.09999                 osd.14   down  1.00000 1.00000 
 15   ssd  1.09999                 osd.15   down  1.00000 1.00000 
 16   ssd  1.09999                 osd.16   down  1.00000 1.00000 
 17   ssd  1.09999                 osd.17   down  1.00000 1.00000 
 18   ssd  1.09999                 osd.18   down  1.00000 1.00000 
 19   ssd  1.09999                 osd.19   down  1.00000 1.00000 
 20   ssd  1.09999                 osd.20   down  1.00000 1.00000 
-24       23.09999     datacenter BIT-2A                          
-27       23.09999         rack rack2                             
 -6        7.70000             host osd4                          
 21   ssd  1.09999                 osd.21     up  1.00000 1.00000 
 22   ssd  1.09999                 osd.22     up  1.00000 1.00000 
 23   ssd  1.09999                 osd.23     up  1.00000 1.00000 
 24   ssd  1.09999                 osd.24     up  1.00000 1.00000 
 25   ssd  1.09999                 osd.25     up  1.00000 1.00000 
 26   ssd  1.09999                 osd.26     up  1.00000 1.00000 
 27   ssd  1.09999                 osd.27     up  1.00000 1.00000 
 -4        7.70000             host osd5                          
 28   ssd  1.09999                 osd.28     up  1.00000 1.00000 
 29   ssd  1.09999                 osd.29     up  1.00000 1.00000 
 30   ssd  1.09999                 osd.30     up  1.00000 1.00000 
 31   ssd  1.09999                 osd.31     up  1.00000 1.00000 
 32   ssd  1.09999                 osd.32     up  1.00000 1.00000 
 33   ssd  1.09999                 osd.33     up  1.00000 1.00000 
 34   ssd  1.09999                 osd.34     up  1.00000 1.00000 
 -7        7.70000             host osd6                          
 35   ssd  1.09999                 osd.35     up  1.00000 1.00000 
 36   ssd  1.09999                 osd.36     up  1.00000 1.00000 
 37   ssd  1.09999                 osd.37     up  1.00000 1.00000 
 38   ssd  1.09999                 osd.38     up  1.00000 1.00000 
 39   ssd  1.09999                 osd.39     up  1.00000 1.00000 
 40   ssd  1.09999                 osd.40     up  1.00000 1.00000 
 41   ssd  1.09999                 osd.41     up  1.00000 1.00000 
-22       23.09999     datacenter BIT-2C                          
-25       23.09999         rack rack3                             
 -8        7.70000             host osd7                          
 42   ssd  1.09999                 osd.42     up  1.00000 1.00000 
 43   ssd  1.09999                 osd.43     up  1.00000 1.00000 
 44   ssd  1.09999                 osd.44     up  1.00000 1.00000 
 45   ssd  1.09999                 osd.45     up  1.00000 1.00000 
 46   ssd  1.09999                 osd.46     up  1.00000 1.00000 
 47   ssd  1.09999                 osd.47     up  1.00000 1.00000 
 48   ssd  1.09999                 osd.48     up  1.00000 1.00000 
 -9        7.70000             host osd8                          
 49   ssd  1.09999                 osd.49     up  1.00000 1.00000 
 50   ssd  1.09999                 osd.50     up  1.00000 1.00000 
 51   ssd  1.09999                 osd.51     up  1.00000 1.00000 
 52   ssd  1.09999                 osd.52     up  1.00000 1.00000 
 53   ssd  1.09999                 osd.53     up  1.00000 1.00000 
 54   ssd  1.09999                 osd.54     up  1.00000 1.00000 
 55   ssd  1.09999                 osd.55     up  1.00000 1.00000 
-10        7.70000             host osd9                          
 56   ssd  1.09999                 osd.56     up  1.00000 1.00000 
 57   ssd  1.09999                 osd.57     up  1.00000 1.00000 
 58   ssd  1.09999                 osd.58     up  1.00000 1.00000 
 59   ssd  1.09999                 osd.59     up  1.00000 1.00000 
 60   ssd  1.09999                 osd.60     up  1.00000 1.00000 
 61   ssd  1.09999                 osd.61     up  1.00000 1.00000 
 62   ssd  1.09999                 osd.62     up  1.00000 1.00000 
 -1              0 root default

Ceph Performance Tuning

CPU governor: Powersave

/mnt/perfdisk1# ioping -D .

--- . (xfs /dev/rbd0) ioping statistics ---
7.20 k requests completed in 2.00 hour, 1.51 k iops, 5.89 MiB/s
min/avg/max/mdev = 378 us / 663 us / 1.15 ms / 90 us

Ceph Performance Tuning

CPU governor: Performance

/mnt/perfdisk1# ioping -D .
4 KiB from . (xfs /dev/rbd0): request=1 time=337 us
4 KiB from . (xfs /dev/rbd0): request=2 time=566 us
4 KiB from . (xfs /dev/rbd0): request=3 time=602 us
...
--- . (xfs /dev/rbd0) ioping statistics ---
7.20 k requests completed in 2.00 hour, 2.08 k iops, 8.12 MiB/s
min/avg/max/mdev = 214 us / 481 us / 937 us / 128 us

182 us difference (~ 38 % decrease in latency, 38% increase in throughput) At the cost of 15% increase in power usage (performance mode)

Cluster on the move

All infra in one rack (easy installation / configuration)

Cluster on the move

All infra in one rack (easy configuration)

Challenges during development

Bootstrap cluster with Ansible (mon <-> mgr)
Namespace challenge (Ceph commands in default namespace)
Need to clear IPv6 neighbour cache after port flap
IPv6 only bug in Ansible (#26740)
Need to reboot all mons after upgrade before Ceph started working again (Luminous RC issue)
Benchmarking is hard

Almost production ready

Change to BlueStore
Final benchmarking
Metrics/dashboard/alerting
Final Domain Failure tests under load
Final fiber ring infra / patching