实施鸿沟和征服策略以将转换应用于大型矩阵
我想应用 arnold的猫地图到我的矩阵。这是我的实现:
import numpy as np
def cat_mapping(matrix, MAX):
width, height = matrix.shape
transformed_matrix = np.empty([width, height]).astype(np.uint8)
counter = 1
x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
nx = (2 * x + y) % width
ny = (x + y) % height
while counter <= MAX:
transformed_matrix[ny, nx] = matrix[y, x]
matrix = transformed_matrix
if counter != MAX:
transformed_matrix = np.empty([width, height])
counter = counter + 1
return transformed_matrix
哪个工作完美。但是,当数组的大小增加&gt; 10000
具有更大的迭代值max
时,此实现确实变得很慢。即使我使用numba
,但结果并不令人满意。
我在想,可以将转换分解成较小的部分,并结合划分和征服的结果吗?
更新
@jeromerichard帮助使用numba
更快地使其更快。但是,我认为,如果我们设法以某种方式实施DC范式?我试图通过这样的一些演示数据来实施:
def split(matrix):
row, col = matrix.shape
row2, col2 = row//2, col//2
return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]
main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))
但是由于“如何合并它们?”,我无法提出更深层次的递归。
任何解决方案或提示都将不胜感激。
I want to apply Arnold's cat map to my matrix. Here is my implementation:
import numpy as np
def cat_mapping(matrix, MAX):
width, height = matrix.shape
transformed_matrix = np.empty([width, height]).astype(np.uint8)
counter = 1
x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
nx = (2 * x + y) % width
ny = (x + y) % height
while counter <= MAX:
transformed_matrix[ny, nx] = matrix[y, x]
matrix = transformed_matrix
if counter != MAX:
transformed_matrix = np.empty([width, height])
counter = counter + 1
return transformed_matrix
Which work perfectly. But when the size of the array increase >10000
with bigger iteration value MAX
, this implementation became really slow. Even I use numba
, but the result is not satisfactory.
I was thinking, can the transformation could be broken into smaller part and combine the result like Divide and Conquer does?
Update
@JeromeRichard helped to make it faster using numba
which is nice. But, I think, is it become more faster if somehow we manage to implement DC paradigm?. I tried to implement with some demo data like this:
def split(matrix):
row, col = matrix.shape
row2, col2 = row//2, col//2
return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]
main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))
But I couldn't come up with deeper recursion because of "How to merge them?".
Any solution or hint will be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于
矩阵[y,x]
创建一个新的临时数组,并且tonfressed_matrix [ny,ny,nx] = matrix [y,x]速度慢,因为它需要从内存层次结构中读取
nx
和ny
,并且内存访问模式并非有效。当矩阵很大时,代码应是内存绑定的,并且不需要的内存操作变得非常昂贵。请注意,np.empty([width,height])
阵列包含双精度浮点数,比np.uint8
需要多8个时间的空间,因此较慢填充内存8倍。您可以使用 numba ,基本循环和 double Buffering 加快代码加快代码,以免创建许多临时数组并阅读大型阵列。这个想法是在循环中计算索引
(NY,NX)
。由于模量非常昂贵,并且内存访问模式会导致代码更加延迟绑定,因此使用多个线程,以便更好地饱和内存。这是结果代码:此代码明显比初始代码要快得多,尤其是当2个缓冲区适合CPU缓存中时。当输入矩阵如此之大以至于不适合缓存时,效率低下的内存访问模式会使代码稍慢一点,但是没有什么可做的,因为计算的行为似乎像是一个不的大型杂乱无章缓存友好。尽管如此,在最大= 20的4096x4096矩阵上,NUMBA代码在我的6核机器上的 (仅约0.38秒,而初始代码为10秒)。
The current code is quite slow because of
matrix[y, x]
create a new temporary array, andtransformed_matrix[ny, nx] = matrix[y, x]
is pretty slow since it needs to readnx
andny
from the memory hierarchy and the memory access pattern is not efficient. When the matrix is big, the code should be memory bound and the unneeded memory operation becomes pretty expensive. Note that thenp.empty([width, height])
array contains double-precision floating-point numbers that takes 8 time more space thannp.uint8
and so it is 8 times slower to fill in memory.You can speed up a lot the code using Numba, basic loops and double buffering so to avoid creating many temporary arrays and read big ones. The idea is to compute the indices
(ny, nx)
on-the-fly within the loops. Since modulus are quite expensive and the memory access pattern cause the code to be more latency bound, multiple threads are used so to better saturate the memory. Here is the resulting code:This code is significantly faster than the initial one, especially when the 2 buffers fit in the CPU cache. When the input matrix is so huge that it does not fit in the cache, the inefficient memory access pattern makes the code a bit slower but there is not much to do since the computation appears to behave like a big shuffle which is not cache-friendly. Still, on a 4096x4096 matrix with MAX=20, the Numba code is 25 times faster on my 6-core machine (only about 0.38 seconds compared to 10 seconds for the initial code).