我正在根据一组特征的二进制存在或不存在来执行对象比较。这些特征可以用比特串来表示,例如:
10011
该比特串具有第一、第四和第五特征。
我试图计算一对位串的相似度作为两者共有的特征数量。对于给定的一组位字符串,我知道它们都具有相同的长度,但我不知道在编译时该长度是多少。
例如,这两个字符串有两个共同特征,因此我希望相似度函数返回 2:
s(10011,10010) = 2
如何在 C++ 中有效地表示和比较位字符串?
I'm performing comparisons of objects based on the binary presence or absence of a set of features. These features can be represented by a bit string, such as this:
10011
This bitstring has the first, fourth and fifth feature.
I'm trying to calculate the similarity of a pair of bit strings as the number of features that both share in common. For a given set of bit strings I know that they'll all have the same length, but I don't know at compile time what that length will be.
For example, these two strings have two features in common, so I'd like the similarity function to return 2:
s(10011,10010) = 2
How do I efficiently represent and compare bit-strings in C++?
发布评论
评论(4)
您可以使用
std::bitset
STL 类。它们可以从位串构建,进行与运算,并计算 1 的数量:
编辑
如果在编译时位数未知,您可以使用
boost::dynamic_bitset
:示例的其他部分不变,因为
boost::dynamic_bitset
与std::bitset
共享一个公共接口。You can use the
std::bitset
STL class.They can be built from bit strings, ANDed, and count the number of 1:
EDIT
If number of bits is unknown at compile time, you can use
boost::dynamic_bitset<>
:Other parts of example don't change, since
boost::dynamic_bitset<>
share a common interface withstd::bitset
.由于您在编译时不知道位长度,因此可以使用
boost::dynamic_bitset
而不是std::bitset
。然后,您可以使用
operator&
(或&=
)查找公共位,并使用boost::dynamic_bitset::count()
对它们进行计数代码>.性能取决于。为了获得最大速度,根据您的编译器,您可能必须自己实现循环,例如使用 @Nawaz 的方法,或来自 Bit Twiddling Hacks,或者使用 sse/popcount/etc 的汇编器/编译器内在函数编写循环。
请注意,至少 llvm、gcc 和 icc 会检测到许多此类模式并为您进行优化,因此在进行手动工作之前分析/检查生成的代码。
As you don't know the bit length at compile time, you can use
boost::dynamic_bitset
instead ofstd::bitset
.You can then use
operator&
(or&=
) to find the common bits, and count them usingboost::dynamic_bitset::count()
.The performance depends. For max speed, depending on your compiler, you may have to implement the loop yourself, e.g. using @Nawaz's method, or something from Bit Twiddling Hacks, or by writing the loop using assembler/compiler intrinsics for sse/popcount/etc.
Notice that at least llvm, gcc and icc detect many patterns of this sort and optimize the thing for you, so profile/check the generated code before doing manual work.
更快的算法:
输出:
ideone 上的演示:http://www.ideone.com/bE4qb
Faster algorithm:
Output:
Demonstration at ideone : http://www.ideone.com/bE4qb
使用
std::bitset
,如果你的特征集小于long中的位数(我认为它是long),你可以获得位的unsigned long表示,然后< em>和两个值,并使用此处 来计数。如果您想继续使用字符串来表示位模式,您可以使用 boost 中的 zip_iterator 执行类似以下操作。
Use a
std::bitset
, if your set of features is less than the number of bits in a long (I think it's a long), you can get an unsigned long representation of the bits, then and the two values, and use bit twidling tricks from here to count.If you want to continue to use strings to represent your bit pattern, you could do something like the following, using the
zip_iterator
from boost.