https://github.com/conda-forge/miniforge
先下载对应版本的 Miniforge3, ====> OS X arm64 (Apple Silicon)
装上之后就有 conda 了,conda 里面装 numpy,scipy 什么的都是原生的
性能提升很大 无论对比 Rosetta 2 还是 intel i9
1
pb941129 2020-12-09 15:39:45 +08:00 via iPhone
想知道对比 Intel i9 mkl 版 numpy 提升多少……
|
2
NoobX 2020-12-09 16:42:16 +08:00 via iPhone
然而 16g 封顶...
|
3
Goldilocks 2020-12-09 16:45:04 +08:00 via Android
期待 benchmark,估计被 avx512 吊打
|
4
felixcode 2020-12-09 19:43:51 +08:00 via Android
显存比你内存大
|
5
YUX OP @pb941129
@NoobX @Goldilocks @felixcode 找到了个 numpy 性能脚本 跑了一下 https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276 ``` Dotted two 4096x4096 matrices in 0.53 s. Dotted two vectors of length 524288 in 0.25 ms. SVD of a 2048x1024 matrix in 0.59 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 4.74 s. This was obtained using the following Numpy configuration: blas_info: libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/Users/yux/miniforge3/envs/maths/lib'] include_dirs = ['/Users/yux/miniforge3/envs/maths/include'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/Users/yux/miniforge3/envs/maths/lib'] include_dirs = ['/Users/yux/miniforge3/envs/maths/include'] language = c lapack_info: libraries = ['lapack', 'blas', 'lapack', 'blas'] library_dirs = ['/Users/yux/miniforge3/envs/maths/lib'] language = f77 lapack_opt_info: libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/Users/yux/miniforge3/envs/maths/lib'] language = c define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] include_dirs = ['/Users/yux/miniforge3/envs/maths/include'] ` ``` p.s. python 版本 3.9.1 -arm64 跑的时候关掉了所有后台 |
6
pb941129 2020-12-09 19:58:15 +08:00 1
@YUX Thx 这是我 16 寸 MBP i9 款跑出来的结果。没有关后台。环境 anaconda 3.8 。看上去比 M1 还是快一点的。(不然 Intel 真的要哭)
``` Dotted two 4096x4096 matrices in 0.45 s. Dotted two vectors of length 524288 in 0.05 ms. SVD of a 2048x1024 matrix in 0.32 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 3.53 s. This was obtained using the following Numpy configuration: blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/Users/xxx/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/Users/xxx/anaconda/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/Users/xxx/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/Users/xxx/anaconda/include'] lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/Users/xxx/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/Users/xxx/anaconda/include'] lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/Users/xxx/anaconda/lib'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/Users/xxx/anaconda/include'] ``` |
7
changepc90 2020-12-09 20:12:20 +08:00
M1:Dotted two vectors of length 524288 in 0.25 ms
MBP16:Dotted two vectors of length 524288 in 0.05 ms. 这一项差的好多啊。 |
9
YUX OP @changepc90 这应该就是指令集差异造成的叭
|
10
Aspector 2020-12-09 20:19:41 +08:00 1
T480s 上的 i7 8550u,库是 mkl_rt
Dotted two 4096x4096 matrices in 1.07 s. Dotted two vectors of length 524288 in 0.13 ms. SVD of a 2048x1024 matrix in 0.53 s. Cholesky decomposition of a 2048x2048 matrix in 0.15 s. Eigendecomposition of a 2048x2048 matrix in 5.07 s. 用 HWMonitor 读出来 8550u 的实时功耗大概在 40-45W,M1 应该才 20W 吧(悲 |
11
YUX OP 分享一下朋友的 16inch 2.6 GHz 6-Core Intel Core i7
Dotted two 4096x4096 matrices in 0.49 s. Dotted two vectors of length 524288 in 0.05 ms. SVD of a 2048x1024 matrix in 0.32 s. Cholesky decomposition of a 2048x2048 matrix in 0.07 s. Eigendecomposition of a 2048x2048 matrix in 3.16 s. |
13
pb941129 2020-12-09 20:25:33 +08:00 via iPhone
@YUX 没看任务,不过以我对 numpy 尿性的理解,不至于不至于。可以等 lightgbm 适配了然后一起跑跑 CPU 版本(当时跑一个小项目找最优参数跑满整个 8700k 三小时
|
14
rock_cloud 2020-12-09 20:25:53 +08:00 1
2017 iMac 3.4Ghz Intel i5
Dotted two 4096x4096 matrices in 1.04 s. Dotted two vectors of length 524288 in 0.17 ms. SVD of a 2048x1024 matrix in 0.58 s. Cholesky decomposition of a 2048x2048 matrix in 0.12 s. Eigendecomposition of a 2048x2048 matrix in 5.37 s. 没关任何后台 |
16
sxd96 2020-12-09 20:31:25 +08:00 1
18 年 13 寸 MBP i5-8259U
Dotted two 4096x4096 matrices in 0.80 s. Dotted two vectors of length 524288 in 0.11 ms. SVD of a 2048x1024 matrix in 0.35 s. Cholesky decomposition of a 2048x2048 matrix in 0.09 s. Eigendecomposition of a 2048x2048 matrix in 3.39 s. |
17
sxd96 2020-12-09 20:35:06 +08:00
@sxd96 感觉心里平衡了一点点,也是没关后台,mkl 库。但是我发现在核心满负载的情况下,MBP 会有一点点电啸声。虽然现在 ARM 在这上面可能差了一点点,但是如果算能效比,可能并不差。我觉得移动设备重要的还是能效比。
|
18
Gandum 2020-12-09 20:35:15 +08:00 via iPhone
还是初步版本。不过现在是冬天还不用急,风扇不太吵。明年夏天再买。
|
19
IgniteWhite 2020-12-09 20:35:29 +08:00 via iPhone 1
哈哈我五个月前发帖讲过啦 /t/688402
|
20
rock_cloud 2020-12-09 20:36:02 +08:00 1
Intel Xeon Silver 4114 2.2Ghz
Dotted two 4096x4096 matrices in 0.60 s. Dotted two vectors of length 524288 in 0.04 ms. SVD of a 2048x1024 matrix in 0.66 s. Cholesky decomposition of a 2048x2048 matrix in 0.26 s. Eigendecomposition of a 2048x2048 matrix in 6.67 s. |
21
YUX OP @IgniteWhite 太超前啦😂确实是个好东西
|
22
Tilie 2020-12-09 20:54:48 +08:00 1
8 代 i7 mac mini
Dotted two 4096x4096 matrices in 0.76 s. Dotted two vectors of length 524288 in 0.09 ms. SVD of a 2048x1024 matrix in 0.56 s. Cholesky decomposition of a 2048x2048 matrix in 0.09 s. Eigendecomposition of a 2048x2048 matrix in 5.20 s. |
23
YUX OP Google Colab - 2 Intel(R) Xeon(R) CPU @ 2.20GHz
Dotted two 4096x4096 matrices in 4.16 s. Dotted two vectors of length 524288 in 0.25 ms. SVD of a 2048x1024 matrix in 1.49 s. Cholesky decomposition of a 2048x2048 matrix in 0.23 s. Eigendecomposition of a 2048x2048 matrix in 13.11 s. |
24
zr86 2020-12-09 21:14:01 +08:00
M1 Mac mini
Dotted two 4096x4096 matrices in 0.69 s. Dotted two vectors of length 524288 in 0.25 ms. SVD of a 2048x1024 matrix in 0.68 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 4.82 s. |
25
wydinhk 2020-12-09 22:21:48 +08:00
M1 MacBook Pro
Dotted two 4096x4096 matrices in 0.68 s. Dotted two vectors of length 524288 in 0.25 ms. SVD of a 2048x1024 matrix in 0.71 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 5.03 s. 同时用 powermetrics 测量功耗,前两项约 26W,后三项约 16W |
26
lovestudykid 2020-12-10 03:17:17 +08:00
这个测试拉不开差距
MF839,只是比楼主的 M1 慢了一倍 Dotted two 4096x4096 matrices in 2.33 s. Dotted two vectors of length 524288 in 0.54 ms. SVD of a 2048x1024 matrix in 1.05 s. Cholesky decomposition of a 2048x2048 matrix in 0.20 s. Eigendecomposition of a 2048x2048 matrix in 8.38 s. Intel(R) Xeon(R) Gold 6134 Dotted two 4096x4096 matrices in 0.32 s. Dotted two vectors of length 524288 in 0.05 ms. SVD of a 2048x1024 matrix in 0.89 s. Cholesky decomposition of a 2048x2048 matrix in 0.15 s. Eigendecomposition of a 2048x2048 matrix in 8.19 s. Anaconda 默认安装的 numpy 版本没有用 mkl,也没有开启 avx512,这个 cpu 是浪费了 |
27
pubby 2020-12-10 10:01:09 +08:00
3700X 黑苹果
Dotted two 4096x4096 matrices in 0.46 s. Dotted two vectors of length 524288 in 0.08 ms. SVD of a 2048x1024 matrix in 7.37 s. Cholesky decomposition of a 2048x2048 matrix in 0.82 s. Eigendecomposition of a 2048x2048 matrix in 49.05 s. This was obtained using the following Numpy configuration: atlas_threads_info: NOT AVAILABLE blas_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3', '-I/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.Internal.sdk/System/Library/Frameworks/vecLib.framework/Headers'] define_macros = [('NO_ATLAS_INFO', 3)] atlas_blas_threads_info: NOT AVAILABLE openblas_info: NOT AVAILABLE lapack_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3'] define_macros = [('NO_ATLAS_INFO', 3)] atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE 使用姿势不太对.... |
28
bnuliujing 2020-12-10 10:18:09 +08:00
i7-6950X 的成绩
Dotted two 4096x4096 matrices in 0.35 s. Dotted two vectors of length 524288 in 0.03 ms. SVD of a 2048x1024 matrix in 0.27 s. Cholesky decomposition of a 2048x2048 matrix in 0.10 s. Eigendecomposition of a 2048x2048 matrix in 3.39 s. |
29
NoobX 2020-12-10 11:05:02 +08:00
Mac Mini i5 款的成绩
Dotted two 4096x4096 matrices in 0.58 s. Dotted two vectors of length 524288 in 0.08 ms. SVD of a 2048x1024 matrix in 0.32 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 3.30 s. M1 成绩印象也不太深刻。。。 不过 16G 内存依旧是一个大问题,系统一般自己就吃掉 4G,16G 只有 12G 放 dataset,老实讲对我不太够用 处理器慢点问题不大,swap 吃满了,那速度是真的噩梦 |
30
MisakaTian 2020-12-10 11:58:25 +08:00
数据狗表示 anaconda 搞定就上
|
31
Goldilocks 2020-12-10 12:06:11 +08:00
Processor Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz, 3600 Mhz, 4 Core
Dotted two 4096x4096 matrices in 0.33s ,比 m1 快一倍。但是 m1 是 8 核哦。所以同等频率同样核数,intel 还是要比 m1 快 3-4 倍左右,这还是 3 年前的产品。 |
32
YUX OP @MisakaTian 用 mamba 啊
|
33
Goldilocks 2020-12-10 12:18:45 +08:00
现在是 2020 年。Intel 如果出个 2 核 3.6G 的 cpu,你肯定看不上它的性能。你要想的是 Intel 10 核、20 核。马上 AMD 都要发布 64 核桌面 CPU 了,apple 还停留在 2 核的水准。
|
34
meloyang05 2020-12-10 13:35:48 +08:00
@Goldilocks
“8 代 i7 mac mini Dotted two 4096x4096 matrices in 0.76 s. Dotted two vectors of length 524288 in 0.09 ms. SVD of a 2048x1024 matrix in 0.56 s. Cholesky decomposition of a 2048x2048 matrix in 0.09 s. Eigendecomposition of a 2048x2048 matrix in 5.20 s. M1 Mac mini Dotted two 4096x4096 matrices in 0.69 s. Dotted two vectors of length 524288 in 0.25 ms. SVD of a 2048x1024 matrix in 0.68 s. Cholesky decomposition of a 2048x2048 matrix in 0.08 s. Eigendecomposition of a 2048x2048 matrix in 4.82 s.” 你选择性无视其他测试成绩么。。时间在 ms 级别本来误差就可能很大,也可能是 numpy for m1 现在有 bug,你单独拎 vector 的成绩出来能说明什么问题? |
35
Goldilocks 2020-12-10 13:38:09 +08:00
误差不会很大,一般都在 1%以内。因为矩阵乘法就受两个限制:
1. CPU flops 2. 内存带宽 |
36
Goldilocks 2020-12-10 13:45:33 +08:00
像矩阵乘法这样的数值计算是很成熟的领域,大家都研究的很透了。请参见这个: https://en.wikichip.org/wiki/flops
假设内存带宽能跟得上 cpu 的速度,要么要想跑的更快,就只有: 1. 增加核数 2. 增加 SIMD 的长度 比如 skylake 可以做到 64 FLOPs/cycle,但是同时代的 AMD CPU 只有 16 FLOPs/cycle 。大家主频都差不多,这其中的 4 倍就造成了主要的差距。而且这种差距很难追赶上,可以说一辈子都没希望。 |
37
Harry1993 2020-12-10 14:08:58 +08:00
用 Apple 的 numpy ( https://github.com/apple/tensorflow_macos)試了一下:
Dotted two 4096x4096 matrices in 0.84 s. Dotted two vectors of length 524288 in 0.11 ms. SVD of a 2048x1024 matrix in 0.54 s. Cholesky decomposition of a 2048x2048 matrix in 0.06 s. Eigendecomposition of a 2048x2048 matrix in 6.29 s. |
38
IgniteWhite 2020-12-10 23:07:30 +08:00
@MisakaTian miniforge 的包管理器不就是 conda 么…只是默认 channel 是 conda-forge
|
39
lly0514 2020-12-11 15:35:01 +08:00
@Goldilocks 实际上误差非常大,我实测 MKL vs openblas 的性能差距有一倍多
|
40
Richardyyz 2020-12-13 09:58:14 +08:00
@Goldilocks ZEN2 都已经 32 FLOPs/cycle 了,你这一辈子这么短吗?降频严重的 AVX512 并没有在 ZEN3 面前有多么大的优势。
|
41
YUX OP 补充一个树莓派的😂
Dotted two 4096x4096 matrices in 10.18 s. Dotted two vectors of length 524288 in 2.27 ms. SVD of a 2048x1024 matrix in 6.67 s. Cholesky decomposition of a 2048x2048 matrix in 0.85 s. Eigendecomposition of a 2048x2048 matrix in 37.83 s. This was obtained using the following Numpy configuration: blas_info: libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/mambaforge/envs/maths/lib'] include_dirs = ['/root/mambaforge/envs/maths/include'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/mambaforge/envs/maths/lib'] include_dirs = ['/root/mambaforge/envs/maths/include'] language = c lapack_info: libraries = ['lapack', 'blas', 'lapack', 'blas'] library_dirs = ['/root/mambaforge/envs/maths/lib'] language = f77 lapack_opt_info: libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/mambaforge/envs/maths/lib'] language = c define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] include_dirs = ['/root/mambaforge/envs/maths/include'] |
42
YRInc 2021-04-23 04:02:49 +08:00
提供一个国产的给大家参考:鲲鹏 920
12 核 鲲鹏 920 24G 内存: ------------------- Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:45:16) Dotted two 4096x4096 matrices in 1.48 s. Dotted two vectors of length 524288 in 0.49 ms. SVD of a 2048x1024 matrix in 1.10 s. Cholesky decomposition of a 2048x2048 matrix in 0.14 s. Eigendecomposition of a 2048x2048 matrix in 8.36 s. ------------------- 24 核 鲲鹏 920 48G 内存: ------------------- Dotted two 4096x4096 matrices in 0.76 s. Dotted two vectors of length 524288 in 0.48 ms. SVD of a 2048x1024 matrix in 0.93 s. Cholesky decomposition of a 2048x2048 matrix in 0.13 s. Eigendecomposition of a 2048x2048 matrix in 7.66 s. 与 M1 Mac 用的同样的环境,Miniforge3,相关的加速库如下: blas_info: libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/miniforge3/lib'] include_dirs = ['/root/miniforge3/include'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] libraries = ['cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/miniforge3/lib'] include_dirs = ['/root/miniforge3/include'] language = c lapack_info: libraries = ['lapack', 'blas', 'lapack', 'blas'] library_dirs = ['/root/miniforge3/lib'] language = f77 lapack_opt_info: libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas'] library_dirs = ['/root/miniforge3/lib'] language = c define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)] include_dirs = ['/root/miniforge3/include'] |