[讨论] 为何 epoll_wait 有性能瓶颈？

Distributions

中文资源站

› 网易开源镜像站

This topic created in 1491 days ago, the information mentioned may be changed or developed.

使用 Linux 提供的benchmark 工具perf 测试如下

机器 28 个核
-m 每个读线程使用独立的 epfd
-t 读线程的数量
-r 执行 benchmark 的时间（给 2 秒跑的快些，默认 8 秒）

./perf bench epoll wait -m -t 1 -r 2
# Running 'epoll/wait' benchmark:
Run summary [PID 23298]: 1 threads monitoring on 64 file-descriptors for 2 secs.

[thread  0] fdmap: 0x27e0260 ... 0x27e035c [ 525069 ops/sec ]

Averaged 525069 operations/sec (+- 0.00%), total secs = 2

./perf bench epoll wait -m -t 2 -r 2
# Running 'epoll/wait' benchmark:
Run summary [PID 23312]: 2 threads monitoring on 64 file-descriptors for 2 secs.

[thread  0] fdmap: 0x1bec260 ... 0x1bec35c [ 463399 ops/sec ]
[thread  1] fdmap: 0x1bec5f0 ... 0x1bec6ec [ 463392 ops/sec ]

Averaged 463395 operations/sec (+- 0.00%), total secs = 2


./perf bench epoll wait -m -t 4 -r 2
# Running 'epoll/wait' benchmark:
Run summary [PID 23319]: 4 threads monitoring on 64 file-descriptors for 2 secs.

[thread  0] fdmap: 0x22ea260 ... 0x22ea35c [ 208576 ops/sec ]
[thread  1] fdmap: 0x22ea5f0 ... 0x22ea6ec [ 208576 ops/sec ]
[thread  2] fdmap: 0x22ea980 ... 0x22eaa7c [ 208576 ops/sec ]
[thread  3] fdmap: 0x22ead10 ... 0x22eae0c [ 208565 ops/sec ]

Averaged 208573 operations/sec (+- 0.00%), total secs = 2

./perf bench epoll wait -m -t 8 -r 2
# Running 'epoll/wait' benchmark:
Run summary [PID 23328]: 8 threads monitoring on 64 file-descriptors for 2 secs.

[thread  0] fdmap: 0x1832370 ... 0x183246c [ 150848 ops/sec ]
[thread  1] fdmap: 0x1832700 ... 0x18327fc [ 150848 ops/sec ]
[thread  2] fdmap: 0x1832a90 ... 0x1832b8c [ 150848 ops/sec ]
[thread  3] fdmap: 0x1832e20 ... 0x1832f1c [ 150848 ops/sec ]
[thread  4] fdmap: 0x18331b0 ... 0x18332ac [ 150848 ops/sec ]
[thread  5] fdmap: 0x1833540 ... 0x183363c [ 150844 ops/sec ]
[thread  6] fdmap: 0x18338d0 ... 0x18339cc [ 150824 ops/sec ]
[thread  7] fdmap: 0x1833c60 ... 0x1833d5c [ 150816 ops/sec ]

Averaged 150840 operations/sec (+- 0.00%), total secs = 2

可以看到每秒执行的的 epoll_wait 数量在减少，变化的参数是线程数，每个读线程调用 epoll_wait 等待 64 个 fd ，一个写线程向所有读线程的所有 fd 写入数据。然后打印的结果是每个读线程每秒能完成的 epoll_wait 数量

按理说，每个读线程都有独立的 epfd ，线程间没有资源共享，为何吞吐量（每秒执行的 epoll 数量）会下降？

16 replies • 2022-05-17 00:39:50 +08:00

wslzy007

May 16, 2022

这样的意义是什么呢？ epoll_wait 是系统调用，原则上越少越好，否则 sys 会非常高

pwrliang

May 16, 2022

@wslzy007 在做网络方面的调研，测试基于 Kernel space 通信方法的极限性能。目前发现了瓶颈，但是不知道原因为何。
系统调用的开销大概是 200 nano second ，`./perf bench syscall basic`，并且这个开销不随着调用进程的数量增加而增加。
但是 epoll_wait 看起来扩展性很差，在达到 CPU 核数前性能就直线下降了。

LeeReamond

May 16, 2022 via Android

这种压力环境下 sys 占用时间应该超过 50%了，实际上没有任何意义，生产中不会出现未经优化的服务裸面对这种压力。另外关于同步问题，你说没有共享资源属于自说自话，毕竟系统并不为每一个进程分配独立的 epoll 管理资源，这意味着内核要解决这个中断不是那个中断，这个中断绑定的是这个 fd 而不是那个 fd ，红黑表数据结构等都是不可避免地需要同步的，也就是无论软件如何实现，落实到硬件上北桥一定有那一瞬间要阻塞，所以实际处理能力选落后于 ipc 就可以理解了

wslzy007

May 16, 2022

事实上如果是追求极致高性能 linux server 架构，epoll_wait per thread 只是 1 种方式，配合 lockfree 算法实现 muilt-epoll handles vs thread-pool 在实践上貌似更优。另外可以充分利用 epoll 支持 leader 模式的特征...
多年前实现过 linux server 下高性能服务器框架，目前在 SG 工具上有使用，如果你有很好的硬件环境可以直接使用 SG 工具做一下性能测试：./proxy_server -i100000 -o10000 -w24 -x28080 (命令参数说明 i:最大接入连接数,o:最大接出连接数,w:最大线程数,x:启动一个简单的 http-server 端口)

wslzy007

May 16, 2022

SG 见： https://github.com/lazy-luo/smarGate

statumer

May 16, 2022

epoll 的定位是高性能网络 IO ，你需要高性能内核线程-用户进程通信的话应该用 Linux 的 UIO 框架。

wslzy007

May 16, 2022

个人拙见，最求高性能最好是降低 sys 占用，毕竟系统中有很多自旋锁，高并发下会空耗 cpu

wslzy007

May 16, 2022

内核线程的确是最优的，但实践中只要有良好的算法避免过线程切换，也是 ok 的

pwrliang

May 16, 2022

@LeeReamond 感谢回复
1. 看了下，每个核的 sys 时间在 20 左右
2. epoll_create 会创建一个独立的 eventpoll 在内核，这个 benchmark 创建的 epoll 对象是隔离的，线程间不会共享同一个 epfd
3. epfd 和读线程的数量还没有达到核数，软中断应该是每个核都有一个进程叫 ksoftirqd 来处理，所以核数越多，irq 的处理能力应该越高
4. 红黑树是在管理多个 fd 用到的，在 epoll_ctl 才会访问
5. 这个 benchmark 的实现，epoll 监控的是 eventfd ，这个 fd 是内核的一个计数器，所以不会走网络的，不知道您提到的北桥是什么意思呢？

pwrliang

May 16, 2022

@wslzy007 您好，现在我不太理解的是为何 epoll 的性能随着线程的增加，每个线程的处理能力在减少。
举个例子，加入 1 个线程每秒能完成 10w 次 epoll ，那么 4 个线程应该能达到 40w （没有超过核数的情况下），但是实际测试不是这样的。给我的感觉是 Linux 的 epoll 实现有个全局锁，导致总体的性能是有极限的。

wslzy007

May 16, 2022

@pwrliang 你需要考虑系统整体负载。毕竟中断、系统调用、协议栈处理、事件回调等都抢占 cpu ，因此 cpu 切换开销不可避免，考虑到 fd 资源对应的内核对象大都是全局的，数量多了、并发大了锁竞争开销也就上去了。一般压测时使用 perf ，Oprofile 等工具采样分析一下就知道瓶颈在哪里了

zizon

May 16, 2022

"一个写线程向所有读线程的所有 fd 写入数据"

会不会是这个问题.
简单算了几个加总 ops 是增加的.

cnbatch

May 16, 2022

想要压榨极限网络性能？那么可以考虑用 io_uring

heiher

May 16, 2022

{ dpdk, raw socket packet mmap, xdp } + busyloop

LeeReamond

May 16, 2022 via Android

@pwrliang 你没有搞清楚我说的用户资源和管理资源的意思。内核通过 sysepollcreate 划分 fd 资源后内核也要对应产生 eventpoll 结构用来管理，并注册到红黑树，该过程是不能多线间无锁并发执行的。另外软中断触发后内核需要调用 sproc 将事件拷贝到用户空间并在内核解绑，同理也是需要线程间同步的，所以无论同步代码怎么实现，最终同步需求硬件层面的北桥阻塞必然发生

pwrliang

May 17, 2022

@zizon 破案了，确实是这个问题，Linux 提供的 benchmark 有问题，只有 1 个 writer 不能够喂饱这么多 reader ，导致 reader 等待，造成频繁的 context switch 。[我改了下 benchmark]( https://github.com/pwrliang/linux/blob/master/tools/perf/bench/epoll-wait.c)，现在每个线程都能够达到 60w 的 throughput
```
./perf bench epoll wait -t 14 -r 2 -w 14 -m
# Running 'epoll/wait' benchmark:
Run summary [PID 5905]: 14 threads monitoring on 64 file-descriptors for 2 secs.

[thread 0] fdmap: 0x1aec590 ... 0x1aec68c [ 648960 ops/sec ]
[thread 1] fdmap: 0x1aec920 ... 0x1aeca1c [ 640540 ops/sec ]
[thread 2] fdmap: 0x1aeccb0 ... 0x1aecdac [ 635712 ops/sec ]
[thread 3] fdmap: 0x1aed040 ... 0x1aed13c [ 650944 ops/sec ]
[thread 4] fdmap: 0x1aed3d0 ... 0x1aed4cc [ 638048 ops/sec ]
[thread 5] fdmap: 0x1aed760 ... 0x1aed85c [ 648064 ops/sec ]
[thread 6] fdmap: 0x1aedaf0 ... 0x1aedbec [ 632416 ops/sec ]
[thread 7] fdmap: 0x1aede80 ... 0x1aedf7c [ 647144 ops/sec ]
[thread 8] fdmap: 0x1aee210 ... 0x1aee30c [ 628896 ops/sec ]
[thread 9] fdmap: 0x1aee5a0 ... 0x1aee69c [ 634521 ops/sec ]
[thread 10] fdmap: 0x1aee930 ... 0x1aeea2c [ 640032 ops/sec ]
[thread 11] fdmap: 0x1aeecc0 ... 0x1aeedbc [ 626093 ops/sec ]
[thread 12] fdmap: 0x1aef050 ... 0x1aef14c [ 645843 ops/sec ]
[thread 13] fdmap: 0x1aef3e0 ... 0x1aef4dc [ 643849 ops/sec ]

Averaged 640075 operations/sec (+- 0.33%), total secs = 2
```