MAC M1异常缓慢

发布于 2025-02-13 14:07:44 字数 3467 浏览 0 评论 0原文

我已经为我的numpy进行了简单的速度测试：

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

结果是：

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

与其他人通常看到的结果相比，这个结果似乎异常缓慢（平均少于10 ms）。我想知道这种行为的原因可能是什么。

我的系统是Macos Big Sur上的M1芯片。 Python版本为3.8.13，Numpy版本为1.22.4。 Numpy是通过

pip install "numpy==1.22.4"

np.show_config（） is：

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

edit：

我对此代码shippet进行了另一项测试（来自 1 ）：

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

我的测试结果是：

mean of 10 runs: 6.17438s

而网站上的参考结果 1 是：（芯片是m1 max），

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

从结果，我的时间，我的时间安排与参考中的任何numpy版本相比，代码较慢。

原文

I have conducted a simple speed test for my numpy:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

The result is:

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This result seems abnormally slow compared with what others typically see (less than 10 ms on average). I'm wondering what could possibly be the cause of such behavior.

My system is MacOS Big Sur on M1 chip. Python version is 3.8.13, numpy version is 1.22.4. The numpy is installed via

pip install "numpy==1.22.4"

The output of np.show_config() is:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

Edit:

I did another test with this code snippet (from 1):

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

The result of my test is:

mean of 10 runs: 6.17438s

whereas the reference results on the website 1 are: (the chip is M1 Max)

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

From the results, the timing of my code is slower compared with any of the numpy versions in the reference.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

楠木可依 2025-02-20 14:07:44

我注意到M1上的类似放缓，但是我认为至少在我的计算机上的实际原因并不是从根本上有故障的Numpy安装，而是基准本身的一些问题。考虑以下示例：

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

computing x = at @ a; eigh（x）需要2 ms，而eigh（在 @ a） 400 ms。我认为在后一种情况下，％timeit是一些问题。也许由于某种原因，计算被路由到“效率核心”？

我的初步答案是，您的第一个基准标准>％TimeIt是不可靠的。

I've noticed similar slowdowns on M1, but I think the actual cause, at least on my computer, is not a fundamentally faulty Numpy installation, but some problem with the benchmarks themselves. Consider the following example:

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Computing x = a.T @ a; eigh(x) takes 2 ms, while eigh(a.T @ a) 400 ms. I think in the latter case it's some problem with %timeit. Maybe for some reason the computation gets routed to "efficiency cores"?

My tentative answer is that your first benchmark with %timeit is not reliable.

回复收藏 0 原文

美人骨 2025-02-20 14:07:44

如果您怀疑在TimeIt中存在问题，请尝试使用时间代替

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

Apple Silicon上的Numpy的更多信息，请阅读链接波纹管中的第一个答案。为了获得最佳性能，建议使用Apple的加速Veclib。如果您使用conda安装，请访问 @andrejhribernik的评论：
为什么M1 Max上的本地人比Old Intel i5上的Python慢得多？

If you suspect an issue in timeit, try using time instead

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

For more information on numpy on Apple silicon, please read the first answer in the link bellow. For optimal performance, it is advised to use Apple's accelerated vecLib. If you install using conda, then check out also @AndrejHribernik's comment:
Why Python native on M1 Max is greatly slower than Python on old Intel i5?

回复收藏 0 原文

~没有更多了~