以有效的方式按字节比较数据（使用 C++）

发布于 2024-09-16 12:26:26 字数 1198 浏览 10 评论 0原文

有没有比使用比较更有效的按字节比较数据的方法 C++ 列表容器的运算符？

我必须比较[大？ 10kByte <尺寸< 500 kByte] 按字节计算的数据量，用于验证外部存储设备的完整性。

因此，我按字节读取文件并将值存储在无符号字符列表中。该列表的资源由 shared_ptr 处理，以便我可以传递它在程序中，无需担心列表的大小。

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

这会执行两次，因为总是有文件的两个副本。必须比较这两个文件的内容，如果发现不匹配则抛出异常

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

我的问题是这样的：
随着文件大小的增加，列表的比较变得非常慢。使用大约 100 kByte 的文件大小时，比较内容最多需要两秒钟。随着文件大小的增加和减少......

有没有更有效的方法来进行这种比较？是不是使用的容器有问题？

原文

Is there a more effective way to compare data bytewise than using the comparison
operator of the C++ list container?

I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.

Therefore I read files bytewise and store the values in a list of unsigned chars.
The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

This is done twice, because there're always two copies of the file.
The content of these two files has to be compared, and throw an exception if a mismatch is found

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....

Is there a more effective way of doing this comparison? Is it a problem of the used container?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绿萝 2024-09-23 12:26:26

不要使用 std::list 使用 std::vector。

std::list 是一个链表，不保证元素连续存储。

另一方面，std::vector 似乎更适合指定的任务（存储连续字节并比较大块数据）。

如果您必须多次比较多个文件并且不关心差异在哪里，您还可以计算每个文件的哈希值并比较哈希值。这样会更快。

回复收藏 0 原文

醉城メ夜风 2024-09-23 12:26:26

我的第一条建议是分析您的代码。

我这么说的原因是，无论你的比较代码有多慢，我怀疑你的文件 I/O 时间都会使它相形见绌。您不想浪费数天时间来尝试优化只占用 1% 运行时间的代码。

甚至可能是您之前没有注意到的其他原因实际上导致了速度缓慢。在您进行个人资料之前您不会知道。

回复收藏 0 原文

孤云独去闲 2024-09-23 12:26:26

如果对这些文件的内容没有什么可做的（看起来你要让它们在 CompareContent() 作用域末尾被 shared_ptr 删除），为什么不使用迭代器比较文件，而不是创建任何文件容器吗？

这是我的一段代码，它按字节比较两个文件：

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

编辑：如果您确实需要保留内容，我认为std::deque可能比更有效code>std::vector 的原因在 GOTW #54 中解释.. 是否——分析会告诉我们。而且，只需要将两个相同文件之一加载到内存中——我将其中一个读入双端队列并与另一个文件的 istreambuf_iterator 进行比较。

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

回复收藏 0 原文

少年亿悲伤 2024-09-23 12:26:26

当您编写时，您正在比较两个文件的内容。然后你可以利用boost的mapped_files。您确实不需要阅读整个文件。您可以即时读取（以 boost 的优化方式），并在找到第一个不相等的字节时停止...

就像 Cubbi 的答案中非常优雅的解决方案： http://www.cplusplus.com/forum/general/94032/ 请注意，在下面他还添加了一些基准，清楚地表明这是最快的方法。我将重写他的答案并添加零文件大小检查，否则会引发异常，并将测试包含在函数中以从早期返回中受益：

#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>

namespace io = boost::iostreams;
namespace fs = boost::filesystem;

bool files_equal(const std::string& path1, const std::string& path2)
{
    fs::path f1(path1);
    fs::path f2(path2);

    if (fs::file_size(f1) != fs::file_size(f2))
        return false;

    // zero-sized files cannot be opened with mapped_file_source
    // hence we consider all zero-sized files equal
    if (fs::file_size(f1) == 0)
        return true;

    io::mapped_file_source mf1(f1.string());
    io::mapped_file_source mf2(f1.string());
    return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}

int main()
{
    if (files_equal("test.1", "test.2"))
        std::cout << "The files are equal.\n";
    else
        std::cout << "The files are not equal.\n";
}

As you write, you are comparing contents of two files. Then you can make use of boost's mapped_files. You really do not need to read the whole file. You can read on the fly (in an optimized way as boost does) and stop when you find the first unequal byte...

Just like the very elegant solution in Cubbi's answer here: http://www.cplusplus.com/forum/general/94032/ Note that just below he also adds some benchmarks which clearly show this is the fastest way. I will just rewrite a bit his answer and add zero-file size check which throws exception otherwise and enclose the test into a function to benefit from early returns:

#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>

namespace io = boost::iostreams;
namespace fs = boost::filesystem;

bool files_equal(const std::string& path1, const std::string& path2)
{
    fs::path f1(path1);
    fs::path f2(path2);

    if (fs::file_size(f1) != fs::file_size(f2))
        return false;

    // zero-sized files cannot be opened with mapped_file_source
    // hence we consider all zero-sized files equal
    if (fs::file_size(f1) == 0)
        return true;

    io::mapped_file_source mf1(f1.string());
    io::mapped_file_source mf2(f1.string());
    return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}

int main()
{
    if (files_equal("test.1", "test.2"))
        std::cout << "The files are equal.\n";
    else
        std::cout << "The files are not equal.\n";
}

回复收藏 0 原文

淡水深流 2024-09-23 12:26:26

std::list 对于 char 元素来说效率非常低 - 每个元素都有开销来促进 O(1) 插入和删除，这实际上不是您的任务所需要的。

如果您必须使用 STL，那么 std::vector 或建议的迭代器方法都比 std::list 更可取，但为什么不直接将数据读入包含在您选择的智能指针中的 char* 中并使用 memcmp 呢？

回复收藏 0 原文

生来就爱笑 2024-09-23 12:26:26

使用 memcmp 以外的任何东西进行比较都是疯狂的。（除非您希望它更快，在这种情况下您可能想用汇编语言对其进行编码。）

回复收藏 0 原文

凉月流沐 2024-09-23 12:26:26

为了在 memcmp 与 equal 辩论中保持客观性，我提供了以下基准测试程序，以便您可以亲自了解哪个（如果有）在您的系统上更快。它还测试运算符==。在我的系统上（Borland C++ 5.5.1 for Win32）：

std::equal: 1375 个时钟周期
运算符==：1297个时钟周期
memcmp：1297 个时钟周期

您的系统上发生了什么？

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }

In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):

std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks

What happens on your system?

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }

回复收藏 0 原文

~没有更多了~