在 C++ 中抓取递归 ntfs 目录的最快方法

发布于 2024-09-05 20:09:02 字数 1247 浏览 7 评论 0原文

我编写了一个小型爬虫来扫描和利用目录结构。

它基于 dirent（这是 FindNextFileA 的一个小包装）在我的第一个基准测试中，速度慢得令人惊讶：

4500 个文件大约需要 123473 毫秒（thinkpad t60p 本地三星 320 GB 2.5" HD）。 123473 毫秒内找到 121481 个文件这个速度正常吗？

这是我的代码：

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}

原文

I have written a small crawler to scan and resort directory structures.

It based on dirent(which is a small wrapper around FindNextFileA)
In my first benchmarks it is surprisingy slow:

around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD).
121481 files found in 123473 milliseconds
Is this speed normal?

This is my code:

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

百合的盛世恋 2024-09-12 20:09:02

在某些情况下这样的速度是正常的。首先，使用 FindFirstFileA 而不是 FindFirstFileW 会产生从 UTF-16 到 ANSI 转换的开销。其次，如果您正在浏览操作系统尚未访问的目录，您将遭受至少一次寻道惩罚（对于大多数消费者硬盘驱动器而言约为 16 毫秒），从而将您的枚举限制为远低于每秒 100 次目录检查。如果给定驱动器上的主文件表碎片严重，情况会变得更糟。

关于文件数量，它将更多地取决于每个目录的文件数量，而不是文件本身的数量。

回复收藏 0 原文

生生漫 2024-09-12 20:09:02

第一次进行递归目录爬网时，您可能应该首先枚举整个当前目录，并将完成后找到的要访问的任何目录排队。这样，您就可以利用 NTFS 可能执行的任何即时预读优化。

在后续的递归目录爬网中，如果目录的元数据适合系统缓存，则无论您如何执行此操作都无关紧要。

编辑：澄清您应该如何访问目录。从技术上讲，这不是广度优先搜索。

回复收藏 0 原文

不美如何 2024-09-12 20:09:02

可能驱动器是瓶颈。但您可以尝试：

可以优化字符串操作 - 使用 char 数组代替 std::string。
不需要每次递归调用都构建 strFullName。使用单个固定的字符缓冲区（即函数内的静态数组），立即修改它。
不要按值传递 strPattern！
在调试之前不要创建 strType
其他人建议在深入递归之前构建要处理的目录列表。为了构建它，我建议使用单个静态数组（类似于 2.）或使用堆栈（alloca）。
文件系统使用 Unicode 来存储文件名吗？如果是这样，将 unicode 字符串与 FindFirstFileW 和 FindNextFileW 一起使用可能会更快一点。