在 C++ 中抓取递归 ntfs 目录的最快方法
我编写了一个小型爬虫来扫描和利用目录结构。
它基于 dirent(这是 FindNextFileA 的一个小包装) 在我的第一个基准测试中,速度慢得令人惊讶:
4500 个文件大约需要 123473 毫秒(thinkpad t60p 本地三星 320 GB 2.5" HD)。 123473 毫秒内找到 121481 个文件 这个速度正常吗?
这是我的代码:
int testPrintDir(std::string strDir, std::string strPattern="*", bool recurse=true){
struct dirent *ent;
DIR *dir;
dir = opendir (strDir.c_str());
int retVal = 0;
if (dir != NULL) {
while ((ent = readdir (dir)) != NULL) {
if (strcmp(ent->d_name, ".") !=0 && strcmp(ent->d_name, "..") !=0){
std::string strFullName = strDir +"\\"+std::string(ent->d_name);
std::string strType = "N/A";
bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
strType = (isDir)?"DIR":"FILE";
if ((!isDir)){
//printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
retVal++;
}
if (isDir && recurse){
retVal += testPrintDir(strFullName, strPattern, recurse);
}
}
}
closedir (dir);
return retVal;
} else {
/* could not open directory */
perror ("DIR NOT FOUND!");
return -1;
}
}
I have written a small crawler to scan and resort directory structures.
It based on dirent(which is a small wrapper around FindNextFileA)
In my first benchmarks it is surprisingy slow:
around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD).
121481 files found in 123473 milliseconds
Is this speed normal?
This is my code:
int testPrintDir(std::string strDir, std::string strPattern="*", bool recurse=true){
struct dirent *ent;
DIR *dir;
dir = opendir (strDir.c_str());
int retVal = 0;
if (dir != NULL) {
while ((ent = readdir (dir)) != NULL) {
if (strcmp(ent->d_name, ".") !=0 && strcmp(ent->d_name, "..") !=0){
std::string strFullName = strDir +"\\"+std::string(ent->d_name);
std::string strType = "N/A";
bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
strType = (isDir)?"DIR":"FILE";
if ((!isDir)){
//printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
retVal++;
}
if (isDir && recurse){
retVal += testPrintDir(strFullName, strPattern, recurse);
}
}
}
closedir (dir);
return retVal;
} else {
/* could not open directory */
perror ("DIR NOT FOUND!");
return -1;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在某些情况下这样的速度是正常的。首先,使用 FindFirstFileA 而不是 FindFirstFileW 会产生从 UTF-16 到 ANSI 转换的开销。其次,如果您正在浏览操作系统尚未访问的目录,您将遭受至少一次寻道惩罚(对于大多数消费者硬盘驱动器而言约为 16 毫秒),从而将您的枚举限制为远低于每秒 100 次目录检查。如果给定驱动器上的主文件表碎片严重,情况会变得更糟。
关于文件数量,它将更多地取决于每个目录的文件数量,而不是文件本身的数量。
There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.
Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.
第一次进行递归目录爬网时,您可能应该首先枚举整个当前目录,并将完成后找到的要访问的任何目录排队。这样,您就可以利用 NTFS 可能执行的任何即时预读优化。
在后续的递归目录爬网中,如果目录的元数据适合系统缓存,则无论您如何执行此操作都无关紧要。
编辑:澄清您应该如何访问目录。从技术上讲,这不是广度优先搜索。
The very first time you ever do a recursive directory crawl, you should probably enumerate the entire current directory first and queue up any directories you find to visit when you are done. That way, you are likely to take advantage of any immediate read-ahead optimizations NTFS might do.
On subsequent recursive directory crawls, if the metadata for the directories fits in the system cache, it doesn't matter how you do it.
EDIT: clarifying how you should visit directories. It isn't technically a breadth first search.
可能驱动器是瓶颈。但您可以尝试:
alloca
)。FindFirstFileW
和FindNextFileW
一起使用可能会更快一点。Probably the drive is the bottleneck. But you can try:
alloca
).FindFirstFileW
andFindNextFileW
may be a little faster.在递归潜水期间,您保持 DIR 手柄打开。相反,保留您遇到的目录的本地列表,并在循环之外、
closedir()
之后处理它们。You're holding DIR handles open during the recursive dive. Instead, keep a local list of directories that you've encountered, and process them outside that loop, after the
closedir()
.