当数字为指定幂时计算数组中数字的向量化函数

发布于 2025-01-12 14:27:52 字数 3248 浏览 6 评论 0原文

我正在尝试矢量化这个相当昂贵的函数（Scaler 现在正在工作！）：

template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
    double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));

    return (x - std::trunc(x)) < 0.000001;
}//End of isPower

这是我到目前为止所拥有的（仅适用于 32 位 int）：

 template<typename RETURN_T>
 inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
 {
      RETURN_T cnt = 0;
      
      const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
      const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));

      const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);

      __m256i _count = _mm256_setzero_si256();
      __m256i _N_INT = _mm256_setzero_si256();
      __m256 _N_DBL = _mm256_setzero_ps();
      __m256 LOG_OF_N = _mm256_setzero_ps();
      __m256 DIVIDE_LOG = _mm256_setzero_ps();
      __m256 TRUNCATED = _mm256_setzero_ps();

      __m256 CMP_MASK = _mm256_setzero_ps();
                                                  
      for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
      {
           //Set Values
           _N_INT = _mm256_load_si256((__m256i*) &arr[i]);
           _N_DBL = _mm256_cvtepi32_ps(_N_INT);
                            
           LOG_OF_N = _mm256_log_ps(_N_DBL);

           DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);

           TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));

           CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);

           _count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
      }//End for

     cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
 }//End of count_powers_of

缩放器版本运行大约 14.1 秒。使用 par_unseq 从 std::count_if 调用的缩放器版本在 4.5 秒内运行。

矢量化版本的运行时间仅为 155 毫秒，但产生了错误的结果。尽管现在已经非常接近了。

测试：

 int64_t count = 0;
 for (size_t i = 0; i < vec.size(); ++i)
 {              
    if (isPower(vec[i], 4))
    {
        ++count;
    }//End if
}//End for

std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000

std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array

这种新的大大简化的代码通常会产生稍微偏离所发现的正确幂数（通常更高）的结果。我认为问题是我从 __m256 到 _m256i 的重新解释，但是当我尝试使用对话（与地板）时，我得到的数字相差很大（再次达到数十亿）。

也可能是这个 sum 函数（基于 @PeterCordes 的代码）：

 inline uint32_t _mm_sum_epi32(__m128i& x)
{
    __m128i hi64 = _mm_unpackhi_epi64(x, x);           
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);
}

 inline uint32_t _mm256_sum_epi32(__m256i& v)
{
    __m128i sum128 = _mm_add_epi32(
        _mm256_castsi256_si128(v),
        _mm256_extracti128_si256(v, 1));
    return _mm_sum_epi32(sum128);
}

我知道这必须是浮点精度/比较问题；有更好的方法来解决这个问题吗？

感谢您迄今为止提供的所有见解和建议。

原文

I am attempting to vectorize this fairly expensive function (Scaler Now working!):

template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
    double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));

    return (x - std::trunc(x)) < 0.000001;
}//End of isPower

Here's what I have so far (for 32-bit int only):

 template<typename RETURN_T>
 inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
 {
      RETURN_T cnt = 0;
      
      const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
      const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));

      const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);

      __m256i _count = _mm256_setzero_si256();
      __m256i _N_INT = _mm256_setzero_si256();
      __m256 _N_DBL = _mm256_setzero_ps();
      __m256 LOG_OF_N = _mm256_setzero_ps();
      __m256 DIVIDE_LOG = _mm256_setzero_ps();
      __m256 TRUNCATED = _mm256_setzero_ps();

      __m256 CMP_MASK = _mm256_setzero_ps();
                                                  
      for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
      {
           //Set Values
           _N_INT = _mm256_load_si256((__m256i*) &arr[i]);
           _N_DBL = _mm256_cvtepi32_ps(_N_INT);
                            
           LOG_OF_N = _mm256_log_ps(_N_DBL);

           DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);

           TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));

           CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);

           _count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
      }//End for

     cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
 }//End of count_powers_of

The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.

The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.

Testing:

 int64_t count = 0;
 for (size_t i = 0; i < vec.size(); ++i)
 {              
    if (isPower(vec[i], 4))
    {
        ++count;
    }//End if
}//End for

std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000

std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array

This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).

It could also be this sum function (based off of code by @PeterCordes ):

 inline uint32_t _mm_sum_epi32(__m128i& x)
{
    __m128i hi64 = _mm_unpackhi_epi64(x, x);           
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);
}

 inline uint32_t _mm256_sum_epi32(__m256i& v)
{
    __m128i sum128 = _mm_add_epi32(
        _mm256_castsi256_si128(v),
        _mm256_extracti128_si256(v, 1));
    return _mm_sum_epi32(sum128);
}

I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?

Thanks for all your insights and suggestions thus far.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤千羽 2025-01-19 14:27:52

更明智的单元测试是非随机的：检查循环中的所有幂以确保它们全部为真，例如 x *= base;，并计算有多少个幂 ;= n.然后循环检查从 0..n 开始的所有数字，每个数字一次以验证总数是否正确。如果这两项检查都成功，则意味着它在所有应有的情况下都返回 false，否则计数将是错误的。

回复：原始版本：

这似乎取决于没有浮点舍入错误。您执行 d == (N)d （如果 N 是整数类型）检查两个对数之比是否为精确整数；即使尾数中有 1 位也会使其不相等。如果具有不同的舍入误差，则不同的日志实现会给出不同的结果，这一点不足为奇。

除了你的标量代码至少更加糟糕，因为它需要 d = Floor(logratio) ，所以它已经总是一个精确的整数。

我刚刚尝试了您的标量版本的测试用例，例如 return isPower(5, 4) 来询问 5 是否是 4 的幂。它返回 true：https://godbolt.org/z/aMT94ro6o 。所以是的，你的代码非常糟糕，实际上只是检查 n>0 或其他东西。 这可以解释为什么 0..999 的 1000 个“随机”输入中有 999 个被算作 4 的幂，这显然是超级破坏的。

我认为你的 FP 对数比率不可能实现正确性想法：FP 舍入误差意味着您不能期望完全相等，但允许一个范围可能会导致不精确的幂。

您可能需要特殊情况的积分 N，2 的幂 pow。通过检查 n 是否设置了单个位 (n & (n-1) == 0) 并且它位于有效位置，可以得到更大的结果。（例如，对于pow=4、n & 0b...10101010 != 0）。您可以通过乘法和加法直到溢出或其他情况来构造常量。还是32/pow次？无论如何，每 8 个元素一个 psubd/pand/pcmpeqd、pand/pcmpeqd 和 pand/psubd，也许还有进一步优化的空间。

否则，在一般情况下，您可以一次对 32 位整数与 int32_t 中 32 或更少的可能幂进行强力检查。例如，广播加载，4x vpcmpeqd / vpsubd 到多个累加器中。（最小可能的基数 2 的指数最大可达 2^31`，并且仍适合无符号整数）。 log_3(2^31) 是 19，因此您只需要三个 AVX2 幂向量。或者 log_4(2^31) 是 15.5，所以你只需要 2 个向量来保存每个非溢出幂。

每个向量仅处理 1 个输入元素，而不是 4 个双精度，但它可能比您当前的 FP 尝试更快，并且修复了正确性问题。我可以看到，每次迭代的吞吐量是您现在正在执行的操作的 4 倍以上，甚至是 8 倍，因此它应该对速度有好处。当然还有一个优点，那就是正确性是可能的！

对于 4 或更大的基数，速度会变得更好，每个输入元素只需 2 倍比较/子，或者对于 16 或更大的基数为 1 倍。（<= 8 个要比较的元素可以放入一个向量中）。

尝试向量化这个可能无法修复的算法时出现实现错误：

_mm256_rem_epi32 是缓慢的库函数，但您使用它的常数除数为 2！整数 mod 2 只是 n & 1 表示非负数。或者，如果您需要处理负余数，则可以使用编译器用于实现 int % 2 的技巧：https://godbolt.org/z/b89eWqEzK 它将符号位下移作为进行有符号除法的修正。

使用 `(x - std::trunc(x)) (x - std::trunc(x)) < 更新版本0.000001;`

这可能有效，特别是如果您将其限制为较小的n。我担心，对于较大的n，精确功率和off-by-1之间的差异将是一个很小的比率。（不过，我还没有真正看过细节。）

使用单精度 float 的 __m256 向量进行向量化注定会产生大的 n ，但对于较小的 n 来说可能没问题：float32 无法表示每个 int32_t，因此较大的奇数整数（高于 2^24）会四舍五入为 2 的倍数，或 2^25 以上 4 的倍数等。

一般来说，float 的相对精度较低，因此可能没有足够的空间来支持该算法。或者也许有一些东西可以修复，IDK，自从更新以来我还没有仔细查看过。

我仍然建议尝试对范围内所有可能的功率进行简单的比较，以广播加载每个元素。这肯定会准确地工作，如果它同样快，那么就没有必要尝试使用 FP 日志来修复这个版本。

__m256 _N_DBL = _mm256_setzero_ps(); 是一个令人困惑的名称；它是浮点向量，而不是双精度向量。（它不是标准库头的一部分，因此不应该使用前导下划线。）

此外，这里有零点用零初始化它，因为它是无条件写入循环内的。事实上，它仅在循环内部使用，因此当您准备好为其赋值时，可以在该范围内声明它。仅当循环后需要时才在外部作用域中声明变量。

A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.

Re: the original version:

This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.

Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.

I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.

I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.

You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.

Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.

That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!

Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).

Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:

_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.

Updated version using `(x - std::trunc(x)) < 0.000001;`

This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)

Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.

float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.

I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.

__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)

Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.

回复收藏 0 原文

~没有更多了~