This document discusses building a high-performance and durable block storage service using Ceph. It describes the architecture, including a minimum deployment of 12 OSD nodes and 3 monitor nodes. It outlines optimizations made to Ceph, Qemu, and the operating system configuration to achieve high performance, including 6000 IOPS and 170MB/s throughput. It also discusses how the CRUSH map can be optimized to reduce recovery times and number of copysets to improve durability to 99.99999999%.
1 of 99
More Related Content
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
1. Build an High-Performance and High-Durable
Block Storage Service Based on Ceph
Rongze & Haomai
[ rongze | haomai ]@unitedstack.com
2. CONTENTS
Block Storage Service
1
3
High Durability
2
High Performance
4
Operation Expericence
6. 6
Now
OpenStack Essex Folsom Havana Icehouse/
Juno Juno
Ceph 0.42 0.67.2 base on
0.67.5
base on
0.67.5
base on
0.80.7
CentOS 6.4 6.5 6.5 6.6
Qemu 0.12 0.12 base on
1.2
base on
1.2 2.0
Kernel 2.6.32 3.12.21 3.12.21 ?
13. nova
VM VM
1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins
10 Gb Network: 20 GB / 1000 MB = 20 s
LVM
SAN
Ceph
20 GB Image 20 GB Image
LocalFS
Swift
Ceph
GlusterFS
LocalFS
NFS
GlusterFS
cinder glance
HTTP HTTP
Boot Storm
14. Nova, Glance, Cinder use the same ceph pool
All action in seconds
No boot storm
cinder glance
14
nova
VM VM
Ceph
15. 15
QoS
• Nova
• Libvirt
• Qemu(throttle)
Two Volume Types
• Cinder multi-backend
• Ceph SSD Pool
• Ceph SATA Pool
Shared Volume
• Read Only
• Multi-attach
18. • CPU:
• Get out of CPU out of power save mode:
• echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null
• Cgroup:
• Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD)
• Memory:
• Turn off NUMA if support NUMA in /etc/grub.conf
• Set vm.swappiness = 0
• Block:
• echo deadline /sys/block/sd[x]/queue/scheduler
• FileSystem
• Mount with “noatime nobarrier
24. Rule 1: Keep FD
• Facts:
• FileStore Bottleneck: Remarkable performance degraded when FD cache missed
• SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory
• Action:
• Increase FDCache/OMapHeader to very large to hold all objects
• Increase object size to 16MB instead of 4MB(default)
• Improve default OS fd limits
• Configuration:
• “filestore_omap_header_cache_size”
• “filestore_fd_cache_size”
• “rbd_chunk_size”(OpenStack Cinder)
25. Rule 2: Sparse Read/Write
• Facts:
• Only few KB exists in Object for RBD usage
• Clone/Recovery will copy full object, harmful to performance and capacity
• Action:
• Use sparse read/write
• Problem:
• XFS or other local filesystems exists existing bugs for fiemap
• Configuration:
• “filestore_fiemap=true”
26. Rule 3: Drop default limits
• Facts:
• Default configuration value is suitable for HDD backend
• Action:
• Change all throttle-related configuration value
• Configuration:
• “filestore_wbthrottle_*”
• “filestore_queue_*”
• “journal_queue_*”
• “…” More related configs(recovery, scrub)
27. Rule 4: Use RBD cache
• Facts:
• RBD cache has remarkable performance
improvement for seq read/write
• Action:
• Enable RBD cache
• Configuration:
• “rbd_cache = true”
32. Result: Latency
• 4K random write for 1TB rbd image: 1.05 ms per IO
• 4K random read for 1TB rbd image: 0.5 ms per IO
• 1.5x latency performance improvement
• Outstanding large dataset performance
34. Ceph Reliability Model
• https://wiki.ceph.com/Development/Reliability_model
• 《CRUSH: Controlled, Scalable, Decentralized Placement of
Replicated Data》
• 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》
• Ceph CRUSH code
Durability Formula
• P = func(N, R, S, AFR)
• P = Pr * M / C(R, N)
38. • Durability depend on OSD recovery time
• Durability depend on the number of Copy set in ceph
pool
the possible PG’s OSD set is Copy set,
the data loss in ceph is loss of any PG, actually is loss of
any Copy set.
If the replication number is 3 and we lost 3 osds in ceph,
the probability of data loss depend on the number of
copy set, because the 3 osds may be not the Copy set.
39. • The shorter recovery time, the higher Durability
• The less the number of Copy set, the higher Durability
47. default CRUSH-MAP setting
server-08
if one OSD out, only two OSDs can do data recovery
so
recovery time is too longer
we
need more OSD to do data recovery to
reduce recovery time
58. deploy
• eNovance: puppet-ceph
• Stackforge: puppet-ceph
• UnitedStack: puppet-ceph
• shorter deploy time
• support all ceph options
• support multi disk type
• wwn-id instead of disk label
• hieradata
75. UnitedStack - The Leading OpenStack Cloud Service
Solution Provider in China
76. VM deployment in seconds
WYSIWYG network topology
High performance cloud storage
Billing by seconds
78. Public Cloud
devel
Beijing1
Guangdong1
Managed Cloud
Cloud Storage
Mobile Socail
Financial
U Center
Unified Cloud Service Platform
Unified Ops
Unified SLA
IDC
test
…… ……
86. P = func(N, R, S, AFR)
• P: the probability of losing all copy
• N: the number of OSD in ceph pool
• R: the number of copy
• S: the number of OSD in bucket(it decide recovery
time)
• AFR: disk annualized failure rate
87. Failure events are
considered to be Poisson
• Failure rates are characterized in units of failures per
billion hours(FITs), and so I have tried to represent all
periodicities in FITs and all times in hours:
fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/
(24*365)
• Event Probabilities, λ is the failure rate, the
probability of n failure events during time t:
Pn(λ,t) = (λt)n e-λt / n!
88. The probability of data loss
• OSD set: copy set, any PG reside in
• data loss: any OSD set loss
• ignore Non-Recoverable Errrors, NRE’s never happen
which might be true on scrubbed osd
89. Non-Recoverable Errors
NREs are read errors that cannot be
corrected by retries or ECC.
• media noise
• high-fly
• off-track writes
90. The probability of R OSDs loss
1. The probability of an initial OSD loss incident.
2. Having sufferred this loss, the probability of losing R-
1 OSDs is based on the recovery time.
3. Multiplied by the probability of the above. The result
is Pr。
91. The probability of Copy sets
loss
1. M = Copy Sets Number in Ceph Pool
2. any R OSDs is C(R, N)
3. the probability of copy sets loss is Pr * M / C(R, N)
92. P = Pr * M / C(R, N)
If R = 3, One PG - (osd.x, osd.y, osd.z)
(osd.x, osd.y, osd.z) is a Copy Set
All Copy Sets are in line with the rule of CRUSH MAP
M = The numer of Copy Sets in Ceph Pool
Pr = The probability of R OSDs loss
C(R, N) = any R OSDs in N
P = The probability of any Copy Sets loss(data loss)
98. trade off between durability and availability
new crush N R S Nines R
Ceph 72 3 3 11 31 mins
Ceph 72 3 6 10 13 mins
Ceph 72 3 12 10 6 mins
Ceph 72 3 24 9 3 mins