Ceph @ BIT

Stefan Kooman, stefan@bit.nl, @basseroet BIT

Agenda

  • From Storage vendor to Open Source Storage
  • Requirements for new storage solution
  • Ceph to the rescue!
  • What will we use Ceph for?
  • Timeline Ceph @ BIT
  • Design production cluster
  • Build the cluster
  • Development challenges

From Storage vendor to Open Source Storage

Reasons to migrate

  • Performance (lack thereof)
  • Does not scale horizontally
  • Lots of critical bugs
  • Lots of maintenance
  • Not happy with support

Looking for alternatives

Laws of the Widos 1/2

Wido Potters (Manager CCS @ BIT):

  • 1) The first 6 months you will be in love with your new storage from vendor X
  • 2) After 6 months cracks arise in the relationship
  • 3) After one year you are totally fed up with your storage solution

Functional Requirements

  • Redundant Block Storage (iSCSI)
  • Redundant File Storage (nfs)
  • Redundant Object Storage (S3)
  • Cluster remain functional in case of a Data center failure
  • low maintenance
  • stable platform
  • horizontally scalable (makes sense for economics too)

Considerations

  • Eat your own dog food
  • No difference between BIT / Customers

Ceph to the rescue!

    Provides us with:
  • Reduindant Block Storage: RBD
  • Redundant File Storage: Cephfs
  • Redundant Object Storage: RGW

Hey, we need support for legacy applications

  • NFS Ganesha on top of Cephfs
  • SuSE iSCSI Gateway (lrbd) on top of Ceph rbd
  • or
  • Ceph iSCSI Gateway on top of Ceph rbd

Ceph @ BIT Timeline

Ceph Test cluster

Ceph Test cluster

(I made this mess, don't blame DC management)

Ceph Test Cluster

  • Useful to test switches
  • Get familiar with Ceph
  • Test Ceph upgrades
  • Test failures
  • Plan for production

Design Production Cluster

  • IPv6 Only (Ceph not dual stack)
  • Access to Ceph storage through BIT Access Network
  • Triple replicas: Datacenter BIT-1, BIT-2A, BIT-2C
  • BGP Routed
  • BGP ECMP across links (easy scaling bandwith)
  • BGP EVPN (VXlan)
  • Redundant paths between DC's (two separate fiber rings)
  • No better kill than overkill! :p

Ceph Production Cluster

Build Production Cluster

Laws of the Widos 2/2

    Wido den Hollander (42on):

  • Ceph IPv6
  • one (public) network
  • one interface
  • one IPv6 address

OOB Management

    Wish for out of band management
  • Linux network namespaces
  • Complete isolation between network stacks

mgmt namespace


/sbin/ip netns add mgmt
/sbin/ip link set $INTERFACE netns mgmt
/sbin/ip netns exec mgmt /sbin/ip link set lo up
/sbin/ip netns exec mgmt /sbin/ip link set $INTERFACE up
/sbin/ip netns exec mgmt /sbin/ip -6 addr add 2001:deadbeef/64 dev $INTERFACE
/sbin/ip netns exec mgmt /sbin/ip -6 route del default
/sbin/ip netns exec mgmt /sbin/ip -6 route add default via $GATEWAY6

Why a seperate management network?

  • Troubleshoot Ceph / network issues through ssh
  • Always Collect metrics
  • ... to ease future changes
  • DPDK in Ceph
  • Make sure daemons live in right namespace

  • sshd
  • rsyslogd
  • bit-monitoring
  • telegraf
  • snappy
  • more to come?
  • Infra as code

    • GIT (instead of SVN)
    • Ansible (instead of cfengine2)
    • GITLAB (workflow)

    Switch infra

    • Storage network similar to access network
    • Provisioned in same way (GIT, Ansible, GITLAB)
    • OOB for switches (Arista) implemented in same way (namespaces)
    • IPv6 only OOB

    Monitoring / Metrics

    • Near real-time info
    • Alerting

    Monitoring / Metrics

    • Icinga (alerting)
    • InfluxDB (storing metrics)
    • Grafana (viewing metrics)
    • Telegraf (collecting metrics)
    • Snappy (collecting Ceph metrics)
    • AMP (continuously perform active network measurements between a mesh of Ceph nodes)

    Ceph tools

    • ceph -s
    • ceph osd tree

    ceph -s

    
      cluster:
        id:     7d44bbde-2442-59dc-8c71-e7dad785c99b
        health: HEALTH_WARN
                1 datacenter (21 osds) down
                21 osds down
                3 hosts (21 osds) down
                1 rack (21 osds) down
                Degraded data redundancy: 19469/56142 objects degraded (34.678%), 2048 pgs unclean, 2048 pgs degraded, 2048 pgs undersized
                2/5 mons down, quorum mon5,mon3,mon4
    
    services:
        mon: 5 daemons, quorum mon5,mon3,mon4, out of quorum: mon1, mon2
        mgr: mon4(active), standbys: mon3, mon5
        osd: 63 osds: 41 up, 62 in
    

    ceph osd tree

    
    ceph osd tree
    ID  CLASS WEIGHT   TYPE NAME              STATUS REWEIGHT PRI-AFF 
    -21       69.29997 region BIT-Ede                                 
    -23       23.09999     datacenter BIT-1                           
    -26       23.09999         rack rack1                             
     -3        7.70000             host osd1                          
      0   ssd  1.09999                 osd.0    down  1.00000 1.00000 
      1   ssd  1.09999                 osd.1    down  1.00000 1.00000 
      2   ssd  1.09999                 osd.2    down  1.00000 1.00000 
      3   ssd  1.09999                 osd.3    down  1.00000 1.00000 
      4   ssd  1.09999                 osd.4    down  1.00000 1.00000 
      5   ssd  1.09999                 osd.5    down  1.00000 1.00000 
      6   ssd  1.09999                 osd.6    down  1.00000 1.00000 
     -2        7.70000             host osd2                          
      7   ssd  1.09999                 osd.7    down  1.00000 1.00000 
      8   ssd  1.09999                 osd.8    down  1.00000 1.00000 
      9   ssd  1.09999                 osd.9    down  1.00000 1.00000 
     10   ssd  1.09999                 osd.10   down  1.00000 1.00000 
     11   ssd  1.09999                 osd.11   down  1.00000 1.00000 
     12   ssd  1.09999                 osd.12   down  1.00000 1.00000 
     13   ssd  1.09999                 osd.13   down  1.00000 1.00000 
     -5        7.70000             host osd3                          
     14   ssd  1.09999                 osd.14   down  1.00000 1.00000 
     15   ssd  1.09999                 osd.15   down  1.00000 1.00000 
     16   ssd  1.09999                 osd.16   down  1.00000 1.00000 
     17   ssd  1.09999                 osd.17   down  1.00000 1.00000 
     18   ssd  1.09999                 osd.18   down  1.00000 1.00000 
     19   ssd  1.09999                 osd.19   down  1.00000 1.00000 
     20   ssd  1.09999                 osd.20   down  1.00000 1.00000 
    -24       23.09999     datacenter BIT-2A                          
    -27       23.09999         rack rack2                             
     -6        7.70000             host osd4                          
     21   ssd  1.09999                 osd.21     up  1.00000 1.00000 
     22   ssd  1.09999                 osd.22     up  1.00000 1.00000 
     23   ssd  1.09999                 osd.23     up  1.00000 1.00000 
     24   ssd  1.09999                 osd.24     up  1.00000 1.00000 
     25   ssd  1.09999                 osd.25     up  1.00000 1.00000 
     26   ssd  1.09999                 osd.26     up  1.00000 1.00000 
     27   ssd  1.09999                 osd.27     up  1.00000 1.00000 
     -4        7.70000             host osd5                          
     28   ssd  1.09999                 osd.28     up  1.00000 1.00000 
     29   ssd  1.09999                 osd.29     up  1.00000 1.00000 
     30   ssd  1.09999                 osd.30     up  1.00000 1.00000 
     31   ssd  1.09999                 osd.31     up  1.00000 1.00000 
     32   ssd  1.09999                 osd.32     up  1.00000 1.00000 
     33   ssd  1.09999                 osd.33     up  1.00000 1.00000 
     34   ssd  1.09999                 osd.34     up  1.00000 1.00000 
     -7        7.70000             host osd6                          
     35   ssd  1.09999                 osd.35     up  1.00000 1.00000 
     36   ssd  1.09999                 osd.36     up  1.00000 1.00000 
     37   ssd  1.09999                 osd.37     up  1.00000 1.00000 
     38   ssd  1.09999                 osd.38     up  1.00000 1.00000 
     39   ssd  1.09999                 osd.39     up  1.00000 1.00000 
     40   ssd  1.09999                 osd.40     up  1.00000 1.00000 
     41   ssd  1.09999                 osd.41     up  1.00000 1.00000 
    -22       23.09999     datacenter BIT-2C                          
    -25       23.09999         rack rack3                             
     -8        7.70000             host osd7                          
     42   ssd  1.09999                 osd.42     up  1.00000 1.00000 
     43   ssd  1.09999                 osd.43     up  1.00000 1.00000 
     44   ssd  1.09999                 osd.44     up  1.00000 1.00000 
     45   ssd  1.09999                 osd.45     up  1.00000 1.00000 
     46   ssd  1.09999                 osd.46     up  1.00000 1.00000 
     47   ssd  1.09999                 osd.47     up  1.00000 1.00000 
     48   ssd  1.09999                 osd.48     up  1.00000 1.00000 
     -9        7.70000             host osd8                          
     49   ssd  1.09999                 osd.49     up  1.00000 1.00000 
     50   ssd  1.09999                 osd.50     up  1.00000 1.00000 
     51   ssd  1.09999                 osd.51     up  1.00000 1.00000 
     52   ssd  1.09999                 osd.52     up  1.00000 1.00000 
     53   ssd  1.09999                 osd.53     up  1.00000 1.00000 
     54   ssd  1.09999                 osd.54     up  1.00000 1.00000 
     55   ssd  1.09999                 osd.55     up  1.00000 1.00000 
    -10        7.70000             host osd9                          
     56   ssd  1.09999                 osd.56     up  1.00000 1.00000 
     57   ssd  1.09999                 osd.57     up  1.00000 1.00000 
     58   ssd  1.09999                 osd.58     up  1.00000 1.00000 
     59   ssd  1.09999                 osd.59     up  1.00000 1.00000 
     60   ssd  1.09999                 osd.60     up  1.00000 1.00000 
     61   ssd  1.09999                 osd.61     up  1.00000 1.00000 
     62   ssd  1.09999                 osd.62     up  1.00000 1.00000 
     -1              0 root default
    

    Ceph Performance Tuning

  • CPU governor: Powersave
  • /mnt/perfdisk1# ioping -D .
    
    --- . (xfs /dev/rbd0) ioping statistics ---
    7.20 k requests completed in 2.00 hour, 1.51 k iops, 5.89 MiB/s
    min/avg/max/mdev = 378 us / 663 us / 1.15 ms / 90 us
    

    Ceph Performance Tuning

  • CPU governor: Performance
  • /mnt/perfdisk1# ioping -D .
    4 KiB from . (xfs /dev/rbd0): request=1 time=337 us
    4 KiB from . (xfs /dev/rbd0): request=2 time=566 us
    4 KiB from . (xfs /dev/rbd0): request=3 time=602 us
    ...
    --- . (xfs /dev/rbd0) ioping statistics ---
    7.20 k requests completed in 2.00 hour, 2.08 k iops, 8.12 MiB/s
    min/avg/max/mdev = 214 us / 481 us / 937 us / 128 us
    
    182 us difference (~ 38 % decrease in latency, 38% increase in throughput) At the cost of 15% increase in power usage (performance mode)

    Cluster on the move

    All infra in one rack (easy installation / configuration)

    Cluster on the move

    All infra in one rack (easy configuration)

    Challenges during development

    • Bootstrap cluster with Ansible (mon <-> mgr)
    • Namespace challenge (Ceph commands in default namespace)
    • Need to clear IPv6 neighbour cache after port flap
    • IPv6 only bug in Ansible (#26740)
    • Need to reboot all mons after upgrade before Ceph started working again (Luminous RC issue)
    • Benchmarking is hard

    Almost production ready

    • Change to BlueStore
    • Final benchmarking
    • Metrics/dashboard/alerting
    • Final Domain Failure tests under load
    • Final fiber ring infra / patching