实施鸿沟和征服策略以将转换应用于大型矩阵

发布于 2025-01-23 12:02:38 字数 1649 浏览 3 评论 0原文

我想应用 arnold的猫地图到我的矩阵。这是我的实现：

import numpy as np

def cat_mapping(matrix, MAX):
    width, height = matrix.shape
    transformed_matrix = np.empty([width, height]).astype(np.uint8)
    counter = 1
    x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
    y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
    nx = (2 * x + y) % width
    ny = (x + y) % height
    while counter <= MAX:
        transformed_matrix[ny, nx] = matrix[y, x]
        matrix = transformed_matrix
        if counter != MAX:
            transformed_matrix = np.empty([width, height])
        counter = counter + 1
    return transformed_matrix

哪个工作完美。但是，当数组的大小增加＆gt; 10000具有更大的迭代值max时，此实现确实变得很慢。即使我使用numba，但结果并不令人满意。

我在想，可以将转换分解成较小的部分，并结合划分和征服的结果吗？

更新

@jeromerichard帮助使用numba更快地使其更快。但是，我认为，如果我们设法以某种方式实施DC范式？我试图通过这样的一些演示数据来实施：

def split(matrix):
    row, col = matrix.shape
    row2, col2 = row//2, col//2
    return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]

main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))

但是由于“如何合并它们？”，我无法提出更深层次的递归。

任何解决方案或提示都将不胜感激。

原文

I want to apply Arnold's cat map to my matrix. Here is my implementation:

import numpy as np

def cat_mapping(matrix, MAX):
    width, height = matrix.shape
    transformed_matrix = np.empty([width, height]).astype(np.uint8)
    counter = 1
    x = np.repeat(np.arange(width).reshape(-1, 1), height, axis=-1).T
    y = np.repeat(np.arange(height).reshape(-1, 1), width, axis=-1)
    nx = (2 * x + y) % width
    ny = (x + y) % height
    while counter <= MAX:
        transformed_matrix[ny, nx] = matrix[y, x]
        matrix = transformed_matrix
        if counter != MAX:
            transformed_matrix = np.empty([width, height])
        counter = counter + 1
    return transformed_matrix

Which work perfectly. But when the size of the array increase >10000 with bigger iteration value MAX, this implementation became really slow. Even I use numba, but the result is not satisfactory.

I was thinking, can the transformation could be broken into smaller part and combine the result like Divide and Conquer does?

Update

@JeromeRichard helped to make it faster using numba which is nice. But, I think, is it become more faster if somehow we manage to implement DC paradigm?. I tried to implement with some demo data like this:

def split(matrix):
    row, col = matrix.shape
    row2, col2 = row//2, col//2
    return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2], matrix[row2:, col2:]

main = np.arange(1000*1000).reshape(1000,1000)
a,b,c,d = split(main)
a = cat_mapping_fast(a,100)
b = cat_mapping_fast(b,100)
c = cat_mapping_fast(c,100)
d = cat_mapping_fast(d,100)
np.vstack((np.hstack((a, b)), np.hstack((c, d))))

But I couldn't come up with deeper recursion because of "How to merge them?".

Any solution or hint will be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空心↖ 2025-01-30 12:02:38

由于矩阵[y，x]创建一个新的临时数组，并且tonfressed_matrix [ny，ny，nx] = matrix [y，x]速度慢，因为它需要从内存层次结构中读取nx和ny，并且内存访问模式并非有效。当矩阵很大时，代码应是内存绑定的，并且不需要的内存操作变得非常昂贵。请注意，np.empty（[width，height]）阵列包含双精度浮点数，比np.uint8需要多8个时间的空间，因此较慢填充内存8倍。

您可以使用 numba ，基本循环和 double Buffering 加快代码加快代码，以免创建许多临时数组并阅读大型阵列。这个想法是在循环中计算索引（NY，NX）。由于模量非常昂贵，并且内存访问模式会导致代码更加延迟绑定，因此使用多个线程，以便更好地饱和内存。这是结果代码：

import numba as nb

@nb.njit('uint8[:,::1](uint8[:,::1], int_)', parallel=True)
def cat_mapping_fast(matrix, MAX):
    height, width = matrix.shape
    buff1 = np.empty((height, width), dtype=np.uint8)
    buff2 = np.empty((height, width), dtype=np.uint8)
    counter = 1
    buff1[:,:] = matrix
    for counter in range(1, MAX+1):
        for y in nb.prange(height):
            for x in range(width):
                nx = (2 * x + y) % width
                ny = (x + y) % height
                buff2[ny, nx] = buff1[y, x]
        buff1, buff2 = buff2, buff1
    return buff1

此代码明显比初始代码要快得多，尤其是当2个缓冲区适合CPU缓存中时。当输入矩阵如此之大以至于不适合缓存时，效率低下的内存访问模式会使代码稍慢一点，但是没有什么可做的，因为计算的行为似乎像是一个不的大型杂乱无章缓存友好。尽管如此，在最大= 20的4096x4096矩阵上，NUMBA代码在我的6核机器上的（仅约0.38秒，而初始代码为10秒）。

The current code is quite slow because of matrix[y, x] create a new temporary array, and transformed_matrix[ny, nx] = matrix[y, x] is pretty slow since it needs to read nx and ny from the memory hierarchy and the memory access pattern is not efficient. When the matrix is big, the code should be memory bound and the unneeded memory operation becomes pretty expensive. Note that the np.empty([width, height]) array contains double-precision floating-point numbers that takes 8 time more space than np.uint8 and so it is 8 times slower to fill in memory.

You can speed up a lot the code using Numba, basic loops and double buffering so to avoid creating many temporary arrays and read big ones. The idea is to compute the indices (ny, nx) on-the-fly within the loops. Since modulus are quite expensive and the memory access pattern cause the code to be more latency bound, multiple threads are used so to better saturate the memory. Here is the resulting code:

import numba as nb

@nb.njit('uint8[:,::1](uint8[:,::1], int_)', parallel=True)
def cat_mapping_fast(matrix, MAX):
    height, width = matrix.shape
    buff1 = np.empty((height, width), dtype=np.uint8)
    buff2 = np.empty((height, width), dtype=np.uint8)
    counter = 1
    buff1[:,:] = matrix
    for counter in range(1, MAX+1):
        for y in nb.prange(height):
            for x in range(width):
                nx = (2 * x + y) % width
                ny = (x + y) % height
                buff2[ny, nx] = buff1[y, x]
        buff1, buff2 = buff2, buff1
    return buff1

This code is significantly faster than the initial one, especially when the 2 buffers fit in the CPU cache. When the input matrix is so huge that it does not fit in the cache, the inefficient memory access pattern makes the code a bit slower but there is not much to do since the computation appears to behave like a big shuffle which is not cache-friendly. Still, on a 4096x4096 matrix with MAX=20, the Numba code is 25 times faster on my 6-core machine (only about 0.38 seconds compared to 10 seconds for the initial code).

回复收藏 0 原文

~没有更多了~