尝试降低几乎但不完全整数的数字类的速度开销
我已经实现了一个 C++ 类,其行为与标准 int
类型非常相似。不同之处在于它有一个额外的概念“epsilon”,它表示一些远小于 1 但大于 0 的微小值。一种将其视为具有 32 个 MSB 的非常宽的定点数(整数部分)、32 个 LSB(epsilon 部分)以及中间大量的零。 (注意:这一类和普通定点数之间的一大区别是,有两个符号,而不是一个:“value”和“epsilon”可以彼此独立地为负数,而对于定点,有一个符号表示整个数字。)
下面的类可以工作,但在整个程序中引入了大约 2 倍的速度损失。 (该程序包含与此类无关的代码,因此此类的实际速度损失可能远大于 2 倍。)我无法粘贴使用此类的代码,但我可以说以下内容:
+、-、+=、<、>
和 >=
是唯一频繁使用的运算符。 setEpsilon()
和 getInt()
的使用极为罕见。 *
也很少见,甚至根本不需要考虑 epsilon 值。
这是类:
#include <limits>
struct int32Uepsilon {
typedef int32Uepsilon Self;
int32Uepsilon () { _value = 0;
_eps = 0; }
int32Uepsilon (const int &i) { _value = i;
_eps = 0; }
void setEpsilon() { _eps = 1; }
Self operator+(const Self &rhs) const { Self result = *this;
result._value += rhs._value;
result._eps += rhs._eps;
return result; }
Self operator-(const Self &rhs) const { Self result = *this;
result._value -= rhs._value;
result._eps -= rhs._eps;
return result; }
Self operator-( ) const { Self result = *this;
result._value = -result._value;
result._eps = -result._eps;
return result; }
Self operator*(const Self &rhs) const { return this->getInt() * rhs.getInt(); } // XXX: discards epsilon
bool operator<(const Self &rhs) const { return (_value < rhs._value) ||
(_value == rhs._value && _eps < rhs._eps); }
bool operator>(const Self &rhs) const { return (_value > rhs._value) ||
(_value == rhs._value && _eps > rhs._eps); }
bool operator>=(const Self &rhs) const { return (_value >= rhs._value) ||
(_value == rhs._value && _eps >= rhs._eps); }
Self &operator+=(const Self &rhs) { this->_value += rhs._value;
this->_eps += rhs._eps;
return *this; }
Self &operator-=(const Self &rhs) { this->_value -= rhs._value;
this->_eps -= rhs._eps;
return *this; }
int getInt() const { return(_value); }
private:
int _value;
int _eps;
};
namespace std {
template<>
struct numeric_limits<int32Uepsilon> {
static const bool is_signed = true;
static int max() { return 2147483647; }
}
};
上面的代码可以工作,但速度很慢。有人对如何提高性能有任何想法吗?我可以提供一些可能有用的提示/细节:
- 32 位绝对不足以同时容纳 _value 和 _eps。在实际应用中,最多可达24~28位 使用 _value 并使用最多 20 位的 _eps。
- 我无法测量使用
int32_t
和int64_t
之间的显着性能差异, 所以内存开销本身可能不是这里的问题。 - 对 _eps 进行饱和加法/减法会很酷,但实际上并不是必需的。
- 注意,_value 和 _eps 的符号不一定相同!这打破了我的第一次尝试 加快这堂课的进度。
- 内联汇编没问题,只要它能在运行 Linux 的 Core i7 系统上与 GCC 配合使用!
I have implemented a C++ class which behaves very similarly to the standard int
type. The difference is that it has an additional concept of "epsilon" which represents some tiny value that is much less than 1, but greater than 0. One way to think of it is as a very wide fixed point number with 32 MSBs (the integer parts), 32 LSBs (the epsilon parts) and a huge sea of zeros in between. (Note: A big difference between this class and normal fixed point numbers is that there are two signs, not one: "value" and "epsilon" can be negative independently of each other, whereas for fixed point, there is one sign for the entire number.)
The following class works, but introduces a ~2x speed penalty in the overall program. (The program includes code that has nothing to do with this class, so the actual speed penalty of this class is probably much greater than 2x.) I can't paste the code that is using this class, but I can say the following:
+, -, +=, <, >
and >=
are the only heavily used operators. Use of setEpsilon()
and getInt()
is extremely rare. *
is also rare, and does not even need to consider the epsilon values at all.
Here is the class:
#include <limits>
struct int32Uepsilon {
typedef int32Uepsilon Self;
int32Uepsilon () { _value = 0;
_eps = 0; }
int32Uepsilon (const int &i) { _value = i;
_eps = 0; }
void setEpsilon() { _eps = 1; }
Self operator+(const Self &rhs) const { Self result = *this;
result._value += rhs._value;
result._eps += rhs._eps;
return result; }
Self operator-(const Self &rhs) const { Self result = *this;
result._value -= rhs._value;
result._eps -= rhs._eps;
return result; }
Self operator-( ) const { Self result = *this;
result._value = -result._value;
result._eps = -result._eps;
return result; }
Self operator*(const Self &rhs) const { return this->getInt() * rhs.getInt(); } // XXX: discards epsilon
bool operator<(const Self &rhs) const { return (_value < rhs._value) ||
(_value == rhs._value && _eps < rhs._eps); }
bool operator>(const Self &rhs) const { return (_value > rhs._value) ||
(_value == rhs._value && _eps > rhs._eps); }
bool operator>=(const Self &rhs) const { return (_value >= rhs._value) ||
(_value == rhs._value && _eps >= rhs._eps); }
Self &operator+=(const Self &rhs) { this->_value += rhs._value;
this->_eps += rhs._eps;
return *this; }
Self &operator-=(const Self &rhs) { this->_value -= rhs._value;
this->_eps -= rhs._eps;
return *this; }
int getInt() const { return(_value); }
private:
int _value;
int _eps;
};
namespace std {
template<>
struct numeric_limits<int32Uepsilon> {
static const bool is_signed = true;
static int max() { return 2147483647; }
}
};
The code above works, but it is quite slow. Does anyone have any ideas on how to improve performance? There are a few hints/details I can give that might be helpful:
- 32 bits are definitely insufficient to hold both _value and _eps. In practice, up to 24 ~ 28 bits of
_value are used and up to 20 bits of _eps are used. - I could not measure a significant performance difference between using
int32_t
andint64_t
,
so memory overhead itself is probably not the problem here. - Saturating addition/subtraction on _eps would be cool, but isn't really necessary.
- Note that the signs of _value and _eps are not necessarily the same! This broke my first attempt
at speeding this class up. - Inline assembly is no problem, so long as it works with GCC on a Core i7 system running Linux!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
可以尝试的一件事是根据
operator+=
定义例如operator+
的规范实践:这有助于 返回值优化,它消除了按值返回所需的复制构造函数。
此外,它还减少了代码维护!
One thing to try is the canonical practice of defining e.g.
operator+
in terms ofoperator+=
:This facilitates the return-value optimization, which eliminates the copy constructor that would otherwise be needed for return-by-value.
Also, it reduces code maintenance!
2 倍的速度损失似乎并不是不合理,因为所有操作都完成了两次。
您可以使用 MMX/SSE2 指令将 value 和 epsilon 打包到两个寄存器中,并仅并行执行这两个操作一次。或者,在 64 位架构上,您可以将两个值打包在一个 int64 中,如下所示:
[32 位值][12 个零][20 位 eps]
。比较将通过单个操作自动进行,加法和减法需要将 eps 的结转屏蔽为填充零。使用MMX进行加法和减法(然后自动屏蔽)和普通整数比较进行比较是没有障碍的。顺便说一句,你的operator-=似乎有问题:在
this->_eps -= rhs._eps
中,this->eps
可能会变成负数。难道你不应该调整 eps 并减少该值吗? eps 的溢出行为是什么?它会转化为价值吗?2x speed penalty does not seem unreasonable since all operations are done twice.
You might use MMX/SSE2 instructions to pack value and epsilon in two registers and perform the two operations in parallel only once. Alternatively, on 64-bit architecture you can pack the two values in a single int64, as in:
[32 bits of value][12 zeros][20 bits of eps]
. Comparisons would work automatically with a single operations, addition and subtraction would need to mask out carry over from eps into padding zeros. There's no obstacle to using MMX for addition and subtraction (masking out happens automatically then) and ordinary integer comparison for comparisons.BTW , your operator-= seems to be buggy: in
this->_eps -= rhs._eps
,this->eps
can become negative. Shouldn't you then adjust both eps and decrement the value? What is the overflow behavior of eps? Does it ever carry over into value?我的(部分)解决方案是在您的解决方案中使用一个整数运算,而不是两个,正如Oli Charlesworth在评论中指出的那样。
您可以这样做:使用
int64_t
来存储_eps
和_value
。在下面的示例中,_value
由bit0-to-bit31
表示,_eps
由bit32-to-bit63
表示代码>.如果没有溢出,那么
+
可以高效可靠地完成。这只是一个开始。尝试思考是否可以使用某些位操作可靠地完成其他操作。简单的加法演示。请阅读评论
输出:
如预期。
Ideone 演示:http://www.ideone.com/GjSnJ
而不是
x + y
,您可以使用x | y
甚至更快的操作。My (partial) solution is using one integer operation, instead of two in your solution as pointed out by Oli Charlesworth in the comment.
Here is how you can do: use
int64_t
to store both_eps
and_value
. In my example below,_value
is represent bybit0-to-bit31
and_eps
is represented bybit32-to-bit63
.If there is no overflow, then
+
can be done efficiently and reliably. This is a just a start. Try thinking if other operations can be done reliably using some bit-operations.A simple demonstration of addition. Please read the comment
Output:
as expected.
Demo at Ideone: http://www.ideone.com/GjSnJ
Instead of
x + y
, you can usex | y
which is even faster operation.谢谢大家的意见。最后我使用内联汇编来实现+、-和+=。令人惊讶的是,GCC 无法自行矢量化这些微小函数,尽管使用 __builtin_ia32_paddd() 内置函数有所帮助。最终,整体程序开销(与使用裸
int
相比)从 100% 减少到 50%。查看生成的程序集,结果似乎是最佳的,至少在我看来是这样。据我快速阅读英特尔手册可知,没有任何东西可以帮助进行比较操作。 (有向量比较指令,但它们在这里都没有帮助。)Thanks for the input, everyone. In the end, I used inline assembly to implement +, - and +=. Amazingly, GCC could not vectorize these tiny functions by itself, though using the
__builtin_ia32_paddd()
builtin helped somewhat. Eventually, the overall program overhead (compared to using bareint
) was reduced from 100% to 50%. Looking at the generated assembly, the result seemed to be optimal, at least where I looked. As far as I could tell from a quick read of the Intel manuals, there is nothing to help with the comparison operations. (There are vector comparison instructions, but none of them are helpful here.)