This topic created in 1199 days ago, the information mentioned may be changed or developed.

Ceph 读写性能问题，读入快，写出慢

大家好，我们需要将所有云环境迁移到 Proxmox 。目前我正在评估测试 Proxmox+Ceph+OpenStack 。

但是现在遇到以下困难：

VMware vSAN 迁移到 ceph 时，我发现 hdd+ssd 在 ceph 中的表现非常糟糕，并且写性能非常差。性能远不及 vSAN
全闪存结构中的 ceph 顺序写入性能还不如单块硬盘，甚至不如单块机械硬盘
在使用 bcache 中的 hdd+ssd 结构中，ceph 的顺序写入性能远低于单块硬盘写入

测试服务器参数（这不重要）：

CPU：Dual Intel® Xeon® E5-2698Bv3

Memory：8 x 16G DDR3

Dual 1 Gbit NIC：Realtek Semiconductor Co., Ltd. RTL8111/8168/8411

Disk：

1 x 500G NVME SAMSUNG MZALQ512HALU-000L1 (同时也是 PVE 中的 ssd-data Thinpool)

1 x 500G SATA WDC_WD5000AZLX-60K2TA0 (物理机系统盘)

2 x 500G SATA WDC_WD5000AZLX-60K2TA0

1 x 1T SATA ST1000LM035-1RK172

PVE：pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.74-1-pve)

Network Configure：

enp4s0 (OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (192.168.1.3/24,192.168.1.1)

enp5s0 (OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000)

vmbr2 (OVS Bridge,MTU=9000)

测试虚拟机参数 x 3 (三台虚拟机是一样的参数)：

CPU：32 (1 sockets, 32 cores) [host]

Memory：32G

Disk：

1 x local-lvm:vm-101-disk-0,iothread=1,size=32G

2 x ssd-data:vm-101-disk-0,iothread=1,size=120G

Network Device：

net0: bridge=vmbr0,firewall=1

net1: bridge=vmbr2,firewall=1,mtu=1 (Ceph Cluster/Public Network)

net2: bridge=vmbr0,firewall=1

net3: bridge=vmbr0,firewall=1

Network Configure：

ens18 (net0,OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (10.10.1.11/24,10.10.1.1)

ens19 (net1,OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000) -> br1ceph (192.168.10.1/24,MTU=9000)

ens20 (net2,Network Device,Active=No)

ens21 (net3,Network Device,Active=No)

基准测试工具:

fio
fio-cdm ( https://github.com/xlucn/fio-cdm)

对于 fio-cdm ，如果不填写任何参数，那么对应于 fio 的配置文件如下

使用 python fio-cdm -f - 可以得到

[global]
ioengine=libaio
filename=.fio_testmark
directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

环境构建步骤:

# prepare tools
root@pve01:~# apt update -y && apt upgrade -y
root@pve01:~# apt install fio git -y
root@pve01:~# git clone https://github.com/xlucn/fio-cdm.git

# create test block
root@pve01:~# rbd create test -s 20G
root@pve01:~# rbd map test
root@pve01:~# mkfs.xfs /dev/rbd0
root@pve01:~# mkdir /mnt/test
root@pve01:/mnt# mount /dev/rbd0 /mnt/test

# start test
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm

环境测试:

Network Bandwidth

root@pve01:~# apt install iperf3 -y
root@pve01:~# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 10.10.1.12, port 52968
[  5] local 10.10.1.11 port 5201 connected to 10.10.1.12 port 52972
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.87 GBytes  16.0 Gbits/sec                  
[  5]   1.00-2.00   sec  1.92 GBytes  16.5 Gbits/sec                  
[  5]   2.00-3.00   sec  1.90 GBytes  16.4 Gbits/sec                  
[  5]   3.00-4.00   sec  1.90 GBytes  16.3 Gbits/sec                  
[  5]   4.00-5.00   sec  1.85 GBytes  15.9 Gbits/sec                  
[  5]   5.00-6.00   sec  1.85 GBytes  15.9 Gbits/sec                  
[  5]   6.00-7.00   sec  1.70 GBytes  14.6 Gbits/sec                  
[  5]   7.00-8.00   sec  1.75 GBytes  15.0 Gbits/sec                  
[  5]   8.00-9.00   sec  1.89 GBytes  16.2 Gbits/sec                  
[  5]   9.00-10.00  sec  1.87 GBytes  16.0 Gbits/sec                  
[  5]  10.00-10.04  sec  79.9 MBytes  15.9 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  18.6 GBytes  15.9 Gbits/sec                  receiver

Jumbo Frames

root@pve01:~# ping -M do -s 8000 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 8000(8028) bytes of data.
8008 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=1.51 ms
8008 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.500 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.500/1.007/1.514/0.507 ms
root@pve01:~#

基准测试分类:

Physical Disk Benchmark ，物理磁盘基准测试
Single osd, single server benchmark ，单个 OSD 单个服务器基准测试
Multiple OSDs, single server benchmarks ，多个 OSD 单个服务器基准测试
Multiple OSDs, multiple server benchmarks ，多个 OSD 多个服务器基准测试

基准测试结果（ Ceph 和系统没有进行过任何调优，没有使用 bcache 加速）

Benchmark Result (Ceph and the system have not been tuned or bcache accelerated. ):

1. Physical Disk Benchmark (Test sequence is 4)（测试顺序是 4 ）

step.

root@pve1:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 465.8G  0 disk 
├─sda1                         8:1    0  1007K  0 part 
├─sda2                         8:2    0   512M  0 part /boot/efi
└─sda3                         8:3    0 465.3G  0 part 
  ├─pve-root                 253:0    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:1    0   3.5G  0 lvm  
  │ └─pve-data-tpool         253:3    0 346.2G  0 lvm  
  │   ├─pve-data             253:4    0 346.2G  1 lvm  
  │   └─pve-vm--100--disk--0 253:5    0    16G  0 lvm  
  └─pve-data_tdata           253:2    0 346.2G  0 lvm  
    └─pve-data-tpool         253:3    0 346.2G  0 lvm  
      ├─pve-data             253:4    0 346.2G  1 lvm  
      └─pve-vm--100--disk--0 253:5    0    16G  0 lvm  
sdb                            8:16   0 931.5G  0 disk 
sdc                            8:32   0 465.8G  0 disk 
sdd                            8:48   0 465.8G  0 disk 
nvme0n1                      259:0    0 476.9G  0 disk 
root@pve1:~# mkfs.xfs /dev/nvme0n1 -f
root@pve1:~# mkdir /mnt/nvme
root@pve1:~# mount /dev/nvme0n1 /mnt/nvme
root@pve1:~# cd /mnt/nvme/

result.

root@pve1:/mnt/nvme# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/nvme 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2361.95|     1435.48|
|SEQ1M Q1 T1 |     1629.84|     1262.63|
|RND4K Q32T16|      954.86|     1078.88|
|. IOPS      |   233119.53|   263398.08|
|. latency us|     2194.84|     1941.78|
|RND4K Q1 T1 |       55.56|      225.06|
|. IOPS      |    13565.49|    54946.21|
|. latency us|       72.76|       16.97|

2. Single osd, single server benchmark (Test sequence is 3)（测试顺序是 3 ）

修改 ceph.conf 中osd_pool_default_min_size和osd_pool_default_size为 1 ，然后systemctl restart ceph.target并修复所有报错

step.

root@pve01:/mnt/test# ceph osd pool get rbd size
size: 2
root@pve01:/mnt/test# ceph config set global  mon_allow_pool_size_one true
root@pve01:/mnt/test# ceph osd pool set rbd min_size 1
set pool 2 min_size to 1
root@pve01:/mnt/test# ceph osd pool set rbd size 1 --yes-i-really-mean-it
set pool 2 size to 1

result

root@pve01:/mnt/test# ceph -s
  cluster:
    id:     1f3eacc8-2488-4e1a-94bf-7181ee7db522
    health: HEALTH_WARN
            2 pool(s) have no replicas configured
 
  services:
    mon: 3 daemons, quorum pve01,pve02,pve03 (age 17m)
    mgr: pve01(active, since 17m), standbys: pve02, pve03
    osd: 6 osds: 1 up (since 19s), 1 in (since 96s)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 281 objects, 1.0 GiB
    usage:   1.1 GiB used, 119 GiB / 120 GiB avail
    pgs:     33 active+clean
 
root@pve01:/mnt/test# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1     down         0  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1153.07|      515.29|
|SEQ1M Q1 T1 |      447.35|      142.98|
|RND4K Q32T16|       99.07|       32.19|
|. IOPS      |    24186.26|     7859.91|
|. latency us|    21148.94|    65076.23|
|RND4K Q1 T1 |        7.47|        1.48|
|. IOPS      |     1823.24|      360.98|
|. latency us|      545.98|     2765.23|
root@pve01:/mnt/test#

3. Multiple OSDs, single server benchmarks (Test sequence is 2)（测试顺序是 2 ）

修改 crushmap 中step chooseleaf firstn 0 type host，将host修改为osd

OSD tree

root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000

result

root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1376.59|      397.29|
|SEQ1M Q1 T1 |      442.74|      111.41|
|RND4K Q32T16|      114.97|       29.08|
|. IOPS      |    28068.12|     7099.90|
|. latency us|    18219.04|    72038.06|
|RND4K Q1 T1 |        6.82|        1.04|
|. IOPS      |     1665.27|      254.40|
|. latency us|      598.00|     3926.30|

4. Multiple OSDs, multiple server benchmarks (Test sequence is 1)（测试顺序是 1 ）

OSD tree

root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2       up   1.00000  1.00000
 3    ssd  0.11719          osd.3       up   1.00000  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4       up   1.00000  1.00000
 5    ssd  0.11719          osd.5       up   1.00000  1.00000

result

tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1527.37|      296.25|
|SEQ1M Q1 T1 |      408.86|      106.43|
|RND4K Q32T16|      189.20|       43.00|
|. IOPS      |    46191.94|    10499.01|
|. latency us|    11068.93|    48709.85|
|RND4K Q1 T1 |        4.99|        0.95|
|. IOPS      |     1219.16|      232.37|
|. latency us|      817.51|     4299.14|

结论

可以看到 ceph 的写入性能（ 106.43MB/s ）与物理磁盘的写入性能（ 1262.63MB/s ）之间的差距是巨大的，甚至 RND4K Q1 T1 的情况下直接变成了机械硬盘
一个或者多个 osd 以及一个或者多个机器对 ceph 的影响并不大（可能是我的集群数量不够）
三个节点构建的 ceph 集群，会导致磁盘 read 性能会下降到原来的一半，write 性能下降到原来的四分之一甚至更多

附录

一些 ssd 基准测试结果

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device

G:\fio>python "E:\Programing\PycharmProjects\fio-cdm\fio-cdm"
tests: 5, size: 1.0GiB, target: G:\fio 228.2GiB/953.8GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |      363.45|      453.54|
|SEQ1M Q1 T1 |      329.47|      404.09|
|RND4K Q32T16|      196.16|      212.42|
|. IOPS      |    47890.44|    51861.48|
|. latency us|    10677.71|     9862.74|
|RND4K Q1 T1 |       20.66|       65.44|
|. IOPS      |     5044.79|    15976.40|
|. latency us|      197.04|       61.07|

SAMSUNG MZALQ512HALU-000L1

root@pve1:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2358.84|     1476.54|
|SEQ1M Q1 T1 |     1702.19|     1291.18|
|RND4K Q32T16|      955.34|     1070.17|
|. IOPS      |   233238.46|   261273.09|
|. latency us|     2193.90|     1957.79|
|RND4K Q1 T1 |       55.04|      229.99|
|. IOPS      |    13437.11|    56149.97|
|. latency us|       73.17|       16.65|

bcache

使用 bcache 加速后的 hdd+ssd 混合磁盘 ceph 架构的测试结果

可以看到 read 有明显提升，但是 write 仍然非常差劲

tests: 5, size: 1.0GiB, target: /mnt/test 104.3MiB/10.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1652.93|      242.41|
|SEQ1M Q1 T1 |      552.91|       81.16|
|RND4K Q32T16|      429.52|       31.95|
|. IOPS      |   104862.76|     7799.72|
|. latency us|     4879.87|    65618.50|
|RND4K Q1 T1 |       13.10|        0.45|
|. IOPS      |     3198.16|      110.09|
|. latency us|      310.07|     9077.11|

即便是一块磁盘上多个 osd 也无法解决 write 问题

详细测试数据： https://www.reddit.com/r/ceph/comments/xnse2j/comment/j6qs57g/?context=3

如果使用 VMware vSAN ，可以很轻松的让 hdd 加速到 ssd 的速度，而且几乎感知不到 hdd 的存在（并未详细对比，我只是凭感觉的）

其他专业的测试报告分析

我分析比较了几个报告，摘要如下

Proxmox-VE_Ceph-Benchmark-201802.pdf

Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf

Dell_R730xd_RedHat_Ceph_Performance_SizingGuide_WhitePaper.pdf

micron_9300_and_red_hat_ceph_reference_architecture.pdf

pve 201802

从报告中得知，测试规模为 6 x Server ，Each server 4 x Samsung SM863 Series, 2.5", 240 GB SSD, SATA-3 (6 Gb/s) MLC.

# Samsung SM863 Series, 2.5", 240 GB SSD
# from https://www.samsung.com/us/business/support/owners/product/sm863-series-240gb/
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ?M Q? T? |      520.00|      485.00|
|RND4K Q? T? |           ?|           ?|
|. IOPS      |    97000.00|    20000.00|

报告结果显示

# 3 Node Cluster/ 4 x Samsung SM863 as OSD per Node
# rados bench 60 write -b 4M -t 16
# rados bench 60 read -t 16 (uses 4M from write)
|Name        |  Read(MB/s)| Write(MB/s)|
# 10 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     1064.42|      789.12|
# 100 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     3087.82|     1011.63|

可以看到网络带宽对性能的影响是巨大的。虽然 10 Gbit Network 下的性能不足，但是至少读写性能都逼近了带宽极限。然而看看我的测试结果，WRITE 非常糟糕(296.25MB/s)

pve 202009

从报告中得知，测试规模为 3 x Server; Each server 4 x Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR); 1 x 100 GbE DACs, in a full-mesh topology

# Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------| 
|SEQ128KQ32T?|     3500.00|     3100.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3340.00|      840.00| (根据公式估算,throughput ~= iops * 4k / 1000)
|. IOPS      |   835000.00|   210000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------| 
|RND4K Q1 T1 |            |      205.82| (从报告中得知)
|. IOPS      |            |    51000.00| (从报告中得知)
|. latency ms|            |        0.02| (从报告中得知)

报告结果显示

# MULTI-VM WORKLOAD (LINUX)
# 我不理解 Thread 和 Job 有什么区别，文档中也没有标识队列深度
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q? T1 |     7176.00|     2581.00| (SEQUENTIAL BANDWIDTH BY NUMBER OF JOBS)
|RND4K Q1 T1 |       86.00|       28.99| (根据公式估算)
|. IOPS      |    21502.00|     7248.00| (RANDOM IO/S BY NUMBER OF JOBS)

同样的，RND4K Q1 T1 WRITE 测试结果非常糟糕，只有 7k iops，而物理磁盘拥有 51k iops ，这样的差距我感觉是无法接受的。

Dell R730xd report

从报告中得知，测试规模为 5 x Storage Server; Each Server 12HDD+3SSD, 3 x replication 2 x 10GbE NIC

# 从报告中摘抄的测试结果
# Figure 8  Throughput/server comparison by using different configurations
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q64T1 |     1150.00|      300.00|

这样的情况下 SEQ4M Q64T1 测试结果中 write 只有大约 300MB/s ，这大概只是单块 SAS 的两倍，也就是 2 x 158.16 MB/s (4M blocks)。这让我难以置信，它甚至快过于我的 nvme 磁盘。不过另一个重要事实是 12*5=60 块 hdd 只有 300MB/s 的顺序写入速度，这样的性能损耗是不是太大了？

Micron report

从报告中得知，测试规模为 3 x Storage Server ； Each Server 10 x micron 9300MAX 12.8T ，2 x replication ，2 x 100GbE NIC

# micron 9300MAX 12.8T (MTFDHAL12T8TDR-1AT1ZABYY) 物理磁盘测试 
|Name        |  Read(MB/s)| Write(MB/s)| (? 是未给出参数)
|------------|------------|------------|
|SEQ?M Q? T? |    48360.00|           ?| (从报告中摘抄)
|SEQ128KQ32T?|     3500.00|     3500.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3400.00|     1240.00| (根据公式估算)
|. IOPS      |   850000.00|   310000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|. latency us|       86.00|       11.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q? T? |     8397.77|     1908.11| (根据公式估算)
|. IOPS      |  2099444.00|   477029.00| (从报告中摘抄，Executive Summary)
|. latency ms|        1.50|        6.70| (从报告中摘抄，Executive Summary)

在 WRITE 测试结果如下，

# (从报告中摘抄)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|RND4KQ32T100|           ?|           ?|
|. IOPS      |  2099444.00|   477029.00| (不知道是不是官网报告存在问题，这里居然没有任何性能损耗)
|. latency ms|        1.52|        6.71|

不得不说 Micron 官方的测试平台过于高端，不是我们中小型企业负担得起。

从结果中得知，WRITE 接近于单块物理磁盘性能。那么是否说明，如果只使用单个节点单个磁盘，那么 WRITE 性能将会下降到 477k / 30 = 15.9k iops ? 如果是的话，那这将是 sata ssd 的性能。

最后的最后，我想了解的问题是：

如何修复 ceph 中的 write 性能问题？ ceph 能不能做到和 VMware vSAN 同样的性能。
结果中看到全闪存磁盘的性能还不如 hdd+ssd ，那么如果不使用 bcache 的话，要怎么做才修复 ceph 全闪存盘下的性能问题？
对于 hdd+ssd 架构是否还有更好的方案？

14 replies • 2025-12-09 07:13:58 +08:00

litguy

Feb 2, 2023

分布式系统写多副本，天然增加延迟
写入性能不高是原理决定的
你对性能预期太高了
ceph 如果性能要上去，需要拼命堆硬件
即使如此，也只是勉强可以接受

你都说你们是中小企业了
还不如别用分布式存储架构
直接集中式存储解决
HDD + SSD ，弄个 ZFS 的 DIY 系统
不说 iops 更好
至少延迟比 ceph 好看

另外，你先把未来业务说说，看看到底是什么 workload
然后才好评估存储

c1462066778

Feb 2, 2023

@litguy 感谢您这么详细的回答。
workload 就是 k8s 集群，跑日常的 springcloud 项目。
我试用过 VMware vSAN ，他们家的似乎性能挺好，只不过内存开销非常大，就算是 1T 缓存加 4T 存储，就需要消耗 20G 内存，再加上 vCenter 和 NSX 集群 12G(vCenter) + 24G(NSX Manager) * 3 + 12G(NSX Edge) * 3 = 120G ，这个开销太大了

litguy

Feb 3, 2023

@c1462066778 我们平常开发用的机器，都是 512GB RAM 64 core 的，最多带 80TB 后端存储空间，你们是存储和虚拟化软件部署在一台机器上了？ HCI 的模式？

SgtPepper

Feb 3, 2023

等一下，商用 Proxmox 真的大丈夫？出问题咋办 proxmox 有支持能开 case 吗
纯好奇

c1462066778

Feb 3, 2023

@litguy 嗯嗯是的 HCI 。小公司被要求做超融合，VMware 买不起许可证了所以要换方案。我现在是用自己电脑测试环境，如果评估通过的话再去生产环境操作

c1462066778

Feb 3, 2023

@SgtPepper [衰] 这也是没办法的办法了

SgtPepper

Feb 3, 2023

@c1462066778 我靠这坑也太大了。。。

luquan

Aug 8, 2023

遇到相同困扰，请问 ceph 的性能问题有什么方向了吗

luquan

Aug 8, 2023

@c1462066778

c1462066778

Aug 11, 2023

@luquan 你好，我已经换 VMware vSAN 了

mingtdlb

Sep 12, 2023

师傅 ceph 使用 bcache 做缓存，有搭建文章么想看看

daxuxu

Dec 19, 2024

存储机器至少采用万兆互联或者万兆聚合，甚至更高的网卡，这样性能才能提高，我这边集群回填时候 cluster network 会达到 18Gbps+。

keen88

Mar 14, 2025

@daxuxu 您也是用的 CEPH 吗，块还是对象

daxuxu

Dec 9, 2025 via Android

@keen88 块对象，文件都有