如何在C++中快速比较变长位串?

发布于 2024-10-14 01:03:40 字数 302 浏览 3 评论 0 原文

我正在根据一组特征的二进制存在或不存在来执行对象比较。这些特征可以用比特串来表示,例如:

10011

该比特串具有第一、第四和第五特征。

我试图计算一对位串的相似度作为两者共有的特征数量。对于给定的一组位字符串,我知道它们都具有相同的长度,但我不知道在编译时该长度是多少。

例如,这两个字符串有两个共同特征,因此我希望相似度函数返回 2:

s(10011,10010) = 2

如何在 C++ 中有效地表示和比较位字符串?

I'm performing comparisons of objects based on the binary presence or absence of a set of features. These features can be represented by a bit string, such as this:

10011

This bitstring has the first, fourth and fifth feature.

I'm trying to calculate the similarity of a pair of bit strings as the number of features that both share in common. For a given set of bit strings I know that they'll all have the same length, but I don't know at compile time what that length will be.

For example, these two strings have two features in common, so I'd like the similarity function to return 2:

s(10011,10010) = 2

How do I efficiently represent and compare bit-strings in C++?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

半葬歌 2024-10-21 01:03:40

您可以使用 std::bitset STL 类。

它们可以从位串构建,进行与运算,并计算 1 的数量:

#include <string>
#include <bitset>

int main()
{
  std::bitset<5> option1(std::string("10011")), option2(std::string("10010"));
  std::bitset<5> and_bit = option1 & option2; //bitset will have 1s only on common options
  size_t s = and_bit.count ();                //return the number of 1 in the bitfield
  return 0;
}

编辑

如果在编译时位数未知,您可以使用 boost::dynamic_bitset

boost::dynamic_bitset<> option(bit_string);

示例的其他部分不变,因为 boost::dynamic_bitsetstd::bitset 共享一个公共接口。

You can use the std::bitset STL class.

They can be built from bit strings, ANDed, and count the number of 1:

#include <string>
#include <bitset>

int main()
{
  std::bitset<5> option1(std::string("10011")), option2(std::string("10010"));
  std::bitset<5> and_bit = option1 & option2; //bitset will have 1s only on common options
  size_t s = and_bit.count ();                //return the number of 1 in the bitfield
  return 0;
}

EDIT

If number of bits is unknown at compile time, you can use boost::dynamic_bitset<>:

boost::dynamic_bitset<> option(bit_string);

Other parts of example don't change, since boost::dynamic_bitset<> share a common interface with std::bitset.

纸短情长 2024-10-21 01:03:40

由于您在编译时不知道位长度,因此可以使用 boost::dynamic_bitset 而不是 std::bitset

然后,您可以使用 operator&(或 &=)查找公共位,并使用 boost::dynamic_bitset::count() 对它们进行计数代码>.

性能取决于。为了获得最大速度,根据您的编译器,您可能必须自己实现循环,例如使用 @Nawaz 的方法,或来自 Bit Twiddling Hacks,或者使用 sse/popcount/etc 的汇编器/编译器内在函数编写循环。

请注意,至少 llvm、gcc 和 icc 会检测到许多此类模式并为您进行优化,因此在进行手动工作之前分析/检查生成的代码。

As you don't know the bit length at compile time, you can use boost::dynamic_bitset instead of std::bitset.

You can then use operator& (or &=) to find the common bits, and count them using boost::dynamic_bitset::count().

The performance depends. For max speed, depending on your compiler, you may have to implement the loop yourself, e.g. using @Nawaz's method, or something from Bit Twiddling Hacks, or by writing the loop using assembler/compiler intrinsics for sse/popcount/etc.

Notice that at least llvm, gcc and icc detect many patterns of this sort and optimize the thing for you, so profile/check the generated code before doing manual work.

所谓喜欢 2024-10-21 01:03:40

更快的算法:

int similarity(unsigned int a, unsigned int b)
{
   unsigned int r = a & b;
   r = ( r & 0x55555555 ) + ((r >> 1) & 0x55555555 );
   r = ( r & 0x33333333 ) + ((r >> 2) & 0x33333333 );
   r = ( r & 0x0f0f0f0f ) + ((r >> 4) & 0x0f0f0f0f );
   r = ( r & 0x00ff00ff ) + ((r >> 8) & 0x00ff00ff );
   r = ( r & 0x0000ffff ) + ((r >>16) & 0x0000ffff );
   return r;
}

int main() {
        unsigned int a = 19 ;//10011
        unsigned int b = 18 ;//10010
        cout << similarity(a,b) << endl; 
        return 0;
}

输出:

2

ideone 上的演示:http://www.ideone.com/bE4qb

Faster algorithm:

int similarity(unsigned int a, unsigned int b)
{
   unsigned int r = a & b;
   r = ( r & 0x55555555 ) + ((r >> 1) & 0x55555555 );
   r = ( r & 0x33333333 ) + ((r >> 2) & 0x33333333 );
   r = ( r & 0x0f0f0f0f ) + ((r >> 4) & 0x0f0f0f0f );
   r = ( r & 0x00ff00ff ) + ((r >> 8) & 0x00ff00ff );
   r = ( r & 0x0000ffff ) + ((r >>16) & 0x0000ffff );
   return r;
}

int main() {
        unsigned int a = 19 ;//10011
        unsigned int b = 18 ;//10010
        cout << similarity(a,b) << endl; 
        return 0;
}

Output:

2

Demonstration at ideone : http://www.ideone.com/bE4qb

陈独秀 2024-10-21 01:03:40

使用std::bitset,如果你的特征集小于long中的位数(我认为它是long),你可以获得位的unsigned long表示,然后< em>和两个值,并使用此处 来计数。


如果您想继续使用字符串来表示位模式,您可以使用 boost 中的 zip_iterator 执行类似以下操作。

#include <iostream>
#include <string>
#include <algorithm>

#include <boost/tuple/tuple.hpp>
#include <boost/iterator/zip_iterator.hpp>

struct check_is_set :
  public std::unary_function<const boost::tuple<char const&, char const&>&, bool>
{
  bool operator()(const boost::tuple<char const&, char const&>& t) const
  {
    const char& cv1 = boost::get<0>(t);
    const char& cv2 = boost::get<1>(t);
    return cv1 == char('1') && cv1 == cv2;
  }
};

size_t count_same(std::string const& opt1, std::string const& opt2)
{
  std::string::const_iterator beg1 = opt1.begin();
  std::string::const_iterator beg2 = opt2.begin();

  // need the same number of items for end (this really is daft, you get a runtime
  // error if the sizes are different otherwise!! I think it's a bug in the
  // zip_iterator implementation...)
  size_t end_s = std::min(opt1.size(), opt2.size());
  std::string::const_iterator end1 = opt1.begin() + end_s;
  std::string::const_iterator end2 = opt2.begin() + end_s;

  return std::count_if(
  boost::make_zip_iterator(
    boost::make_tuple(beg1, beg2)
    ),
  boost::make_zip_iterator(
    boost::make_tuple(end1, end2)
    ),
    check_is_set()
  );
}

int main(void)
{
  std::string opt1("1010111");
  std::string opt2("001101");

  std::cout << "same: " << count_same(opt1, opt2) << std::endl;

  return 0;
}

Use a std::bitset, if your set of features is less than the number of bits in a long (I think it's a long), you can get an unsigned long representation of the bits, then and the two values, and use bit twidling tricks from here to count.


If you want to continue to use strings to represent your bit pattern, you could do something like the following, using the zip_iterator from boost.

#include <iostream>
#include <string>
#include <algorithm>

#include <boost/tuple/tuple.hpp>
#include <boost/iterator/zip_iterator.hpp>

struct check_is_set :
  public std::unary_function<const boost::tuple<char const&, char const&>&, bool>
{
  bool operator()(const boost::tuple<char const&, char const&>& t) const
  {
    const char& cv1 = boost::get<0>(t);
    const char& cv2 = boost::get<1>(t);
    return cv1 == char('1') && cv1 == cv2;
  }
};

size_t count_same(std::string const& opt1, std::string const& opt2)
{
  std::string::const_iterator beg1 = opt1.begin();
  std::string::const_iterator beg2 = opt2.begin();

  // need the same number of items for end (this really is daft, you get a runtime
  // error if the sizes are different otherwise!! I think it's a bug in the
  // zip_iterator implementation...)
  size_t end_s = std::min(opt1.size(), opt2.size());
  std::string::const_iterator end1 = opt1.begin() + end_s;
  std::string::const_iterator end2 = opt2.begin() + end_s;

  return std::count_if(
  boost::make_zip_iterator(
    boost::make_tuple(beg1, beg2)
    ),
  boost::make_zip_iterator(
    boost::make_tuple(end1, end2)
    ),
    check_is_set()
  );
}

int main(void)
{
  std::string opt1("1010111");
  std::string opt2("001101");

  std::cout << "same: " << count_same(opt1, opt2) << std::endl;

  return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文