对于 Java 在遍历大型目录时性能不佳的问题,是否有解决方法?

发布于 2024-07-10 05:24:45 字数 632 浏览 7 评论 0 原文

我正在尝试一次处理一个通过网络存储的文件。 由于缓冲不是问题,因此读取文件速度很快。 我遇到的问题只是列出文件夹中的目录。 我在许多文件夹中每个文件夹至少有 10k 个文件。

由于 File.list() 返回一个数组而不是一个可迭代对象,因此性能非常慢。 Java 启动并收集文件夹中的所有名称,并将其打包到一个数组中,然后返回。

此错误条目为 https://bugs.java.com/ bugdatabase/view_bug;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 并且没有解决方法。 他们只是说这个问题已经在 J​​DK7 中修复了。

有几个问题:

  1. 有人有解决这个性能瓶颈的方法吗?
  2. 我是否正在努力实现不可能的目标? 即使只是迭代目录,性能仍然会很差吗?
  3. 我可以使用具有此功能的 beta JDK7 版本,而不必在其上构建整个项目吗?

I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing the directories in a folder. I have at least 10k files per folder over many folders.

Performance is super slow since File.list() returns an array instead of an iterable. Java goes off and collects all the names in a folder and packs it into an array before returning.

The bug entry for this is https://bugs.java.com/bugdatabase/view_bug;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 and doesn't have a work around. They just say this has been fixed for JDK7.

A few questions:

  1. Does anybody have a workaround to this performance bottleneck?
  2. Am I trying to achieve the impossible? Is performance still going to be poor even if it just iterates over the directories?
  3. Could I use the beta JDK7 builds that have this functionality without having to build my entire project on it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

千里故人稀 2024-07-17 05:24:45

虽然它不太漂亮,但我通过在启动应用程序之前将 dir/ls 的输出通过管道传输到文件并传入文件名来解决了此类问题。

如果您需要在应用程序内执行此操作,则可以使用 system.exec(),但它会造成一些麻烦。

你问。 第一种形式将会非常快,第二种形式也应该相当快。

请务必执行每行一项(裸、无装饰、无图形)、所选命令的完整路径和递归选项。

编辑:

30 分钟只是为了获得目录列表,哇。

让我震惊的是,如果您使用 exec(),您可以将其标准输出重定向到管道中,而不是将其写入文件中。

如果您这样做了,您应该立即开始获取文件,并能够在命令完成之前开始处理。

这种互动实际上可能会减慢速度,但也许不会——你可以尝试一下。

哇,我刚刚为您找到了 .exec 命令的语法,并发现了这个,可能正是您想要的(它使用 exec 和“ls”列出了一个目录,并将结果通过管道传输到您的程序中进行处理): 回程中的良好链接 (Jörg 在评论中提供了替换 无论如何,这个想法很简单,

但是正确地编写代码却很烦人。 我会从互联网上窃取一些代码并破解它们 --brb

/**
 * Note: Only use this as a last resort!  It's specific to windows and even
 * at that it's not a good solution, but it should be fast.
 * 
 * to use it, extend FileProcessor and call processFiles("...") with a list
 * of options if you want them like /s... I highly recommend /b
 * 
 * override processFile and it will be called once for each line of output.
 */
import java.io.*;

public abstract class FileProcessor
{
   public void processFiles(String dirOptions)
   {
      Process theProcess = null;
      BufferedReader inStream = null;

      // call the Hello class
      try
      {
          theProcess = Runtime.getRuntime().exec("cmd /c dir " + dirOptions);
      }
      catch(IOException e)
      {
         System.err.println("Error on exec() method");
         e.printStackTrace();  
      }

      // read from the called program's standard output stream
      try
      {
         inStream = new BufferedReader(
                                new InputStreamReader( theProcess.getInputStream() ));  
         processFile(inStream.readLine());
      }
      catch(IOException e)
      {
         System.err.println("Error on inStream.readLine()");
         e.printStackTrace();  
      }

   } // end method
   /** Override this method--it will be called once for each file */
   public abstract void processFile(String filename);


} // end class

感谢 IBM

Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.

If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.

You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.

Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.

EDIT:

30 minutes just to get a directory listing, wow.

It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.

If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.

The interaction may actually slow things down, but maybe not--you might give it a try.

Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing): good link in wayback (Jörg provided in a comment to replace this one from sun that Oracle broke)

Anyway, the idea is straightforward but getting the code right is annoying. I'll go steal some codes from the internets and hack them up--brb

/**
 * Note: Only use this as a last resort!  It's specific to windows and even
 * at that it's not a good solution, but it should be fast.
 * 
 * to use it, extend FileProcessor and call processFiles("...") with a list
 * of options if you want them like /s... I highly recommend /b
 * 
 * override processFile and it will be called once for each line of output.
 */
import java.io.*;

public abstract class FileProcessor
{
   public void processFiles(String dirOptions)
   {
      Process theProcess = null;
      BufferedReader inStream = null;

      // call the Hello class
      try
      {
          theProcess = Runtime.getRuntime().exec("cmd /c dir " + dirOptions);
      }
      catch(IOException e)
      {
         System.err.println("Error on exec() method");
         e.printStackTrace();  
      }

      // read from the called program's standard output stream
      try
      {
         inStream = new BufferedReader(
                                new InputStreamReader( theProcess.getInputStream() ));  
         processFile(inStream.readLine());
      }
      catch(IOException e)
      {
         System.err.println("Error on inStream.readLine()");
         e.printStackTrace();  
      }

   } // end method
   /** Override this method--it will be called once for each file */
   public abstract void processFile(String filename);


} // end class

And thank you code donor at IBM

就是爱搞怪 2024-07-17 05:24:45

如何使用 File.list(FilenameFilter filter) 方法并实现 FilenameFilter.accept(File dir, String name) 来处理每个文件并返回 false。

我在 Linux 虚拟机上针对包含 10K 以上文件的目录运行了此命令,花费了不到 10 秒的时间。

import java.io.File;  
import java.io.FilenameFilter;

public class Temp {
    private static void processFile(File dir, String name) {
        File file = new File(dir, name);
        System.out.println("processing file " + file.getName());
    }

    private static void forEachFile(File dir) {
        String [] ignore = dir.list(new FilenameFilter() {
            public boolean accept(File dir, String name) {
                processFile(dir, name);
                return false;
            }
        });
    }

    public static void main(String[] args) {
        long before, after;
        File dot = new File(".");
        before = System.currentTimeMillis();
        forEachFile(dot);
        after = System.currentTimeMillis();
        System.out.println("after call, delta is " + (after - before));
    }  
}

How about using File.list(FilenameFilter filter) method and implementing FilenameFilter.accept(File dir, String name) to process each file and return false.

I ran this on Linux vm for directory with 10K+ files and it took <10 seconds.

import java.io.File;  
import java.io.FilenameFilter;

public class Temp {
    private static void processFile(File dir, String name) {
        File file = new File(dir, name);
        System.out.println("processing file " + file.getName());
    }

    private static void forEachFile(File dir) {
        String [] ignore = dir.list(new FilenameFilter() {
            public boolean accept(File dir, String name) {
                processFile(dir, name);
                return false;
            }
        });
    }

    public static void main(String[] args) {
        long before, after;
        File dot = new File(".");
        before = System.currentTimeMillis();
        forEachFile(dot);
        after = System.currentTimeMillis();
        System.out.println("after call, delta is " + (after - before));
    }  
}
白云不回头 2024-07-17 05:24:45

另一种方法是通过不同的协议提供文件。 据我了解,您正在使用 SMB 来实现这一点,而 java 只是试图将它们列为常规文件。

这里的问题可能不仅仅是java(当您使用Microsoft Explorer x:\shared 打开该目录时它的行为如何)根据我的经验,它也需要相当多的时间。

您可以将协议更改为 HTTP 之类的协议,仅用于获取文件名。 这样你就可以通过 http 检索文件列表(10k 行应该不会太多)并让服务器处理文件列表。 这将非常快,因为它将使用本地资源(服务器中的资源)运行

然后当您拥有列表时,您可以按照您现在正在做的方式处理它们。

关键是在节点的另一端要有一个援助机制。

这可行吗?

今天:

File [] content = new File("X:\\remote\\dir").listFiles();

for ( File f : content ) {
    process( f );
}

建议:

String [] content = fetchViaHttpTheListNameOf("x:\\remote\\dir");

for ( String fileName : content ) {
    process( new File( fileName ) );
}

http 服务器可以是一个非常小的、简单的文件。

如果这是你现在的方式,那么你要做的就是将所有 10k 文件信息获取到你的客户端计算机(我不知道有多少信息),而你只需要文件名以供以后处理。

如果现在处理速度非常快,则可能会减慢一点。 这是因为预取的信息不再可用。

试一试。

An alternative is to have the files served over a different protocol. As I understand you're using SMB for that and java is just trying to list them as a regular file.

The problem here might not be java alone ( how does it behaves when you open that directory with Microsoft Explorer x:\shared ) In my experience it also take a considerably amount of time.

You can change the protocol to something like HTTP, only to fetch the file names. This way you can retrieve the list of files over http ( 10k lines should't be too much ) and let the server deal with file listing. This would be very fast, since it will run with local resources ( those in the server )

Then when you have the list, you can process them one by exactly the way you're doing right now.

The keypoint is to have an aid mechanism in the other side of the node.

Is this feasible?

Today:

File [] content = new File("X:\\remote\\dir").listFiles();

for ( File f : content ) {
    process( f );
}

Proposed:

String [] content = fetchViaHttpTheListNameOf("x:\\remote\\dir");

for ( String fileName : content ) {
    process( new File( fileName ) );
}

The http server could be a very small small and simple file.

If this is the way you have it right now, what you're doing is to fetch all the 10k files information to your client machine ( I don't know how much of that info ) when you only need the file name for later processing.

If the processing is very fast right now it may be slowed down a bit. This is because the information prefetched is no longer available.

Give it a try.

情愿 2024-07-17 05:24:45

不可移植的解决方案是对操作系统进行本机调用并传输结果。

对于 Linux

您可以查看类似 readdir 的内容。 您可以像链接列表一样遍历目录结构,并批量或单独返回结果。

对于 Windows

在 Windows 中,使用 FindFirstFileFindNextFile< /a> api.

A non-portable solution would be to make native calls to the operating system and stream the results.

For Linux

You can look at something like readdir. You can walk the directory structure like a linked list and return results in batches or individually.

For Windows

In windows the behavior would be fairly similar using FindFirstFile and FindNextFile apis.

π浅易 2024-07-17 05:24:45

我怀疑该问题与您引用的错误报告有关。
问题“仅”在于内存使用,但不一定是速度。
如果您有足够的内存,则该错误与您的问题无关。

您应该衡量您的问题是否与内存相关。 打开垃圾收集器日志并使用例如 gcviewer 来分析您的内存使用情况。

我怀疑这与导致问题的 SMB 协议有关。
您可以尝试用另一种语言编写测试,看看它是否更快,或者您可以尝试通过其他方法获取文件名列表,例如另一篇文章中描述的方法。

I doubt the problem is relate to the bug report you referenced.
The issue there is "only" memory usage, but not necessarily speed.
If you have enough memory the bug is not relevant for your problem.

You should measure whether your problem is memory related or not. Turn on your Garbage Collector log and use for example gcviewer to analyze your memory usage.

I suspect that it has to do with the SMB protocol causing the problem.
You can try to write a test in another language and see if it's faster, or you can try to get the list of filenames through some other method, such as described here in another post.

星星的轨迹 2024-07-17 05:24:45

如果您最终需要处理所有文件,那么使用 Iterable 而不是 String[] 不会给您带来任何优势,因为您仍然需要获取整个文件列表。

If you need to eventually process all files, then having Iterable over String[] won't give you any advantage, as you'll still have to go and fetch the whole list of files.

自演自醉 2024-07-17 05:24:45

如果您使用的是 Java 1.5 或 1.6,则在 Windows 上删除“dir”命令并解析标准输出流是一种完全可以接受的方法。 我过去曾使用这种方法来处理网络驱动器,它通常比等待本机 java.io.File listFiles() 方法返回要快得多。

当然,JNI 调用应该比 shell 出“dir”命令更快并且可能更安全。 以下 JNI 代码可用于使用 Windows API 检索文件/目录列表。 该函数可以轻松地重构为一个新类,以便调用者可以增量检索文件路径(即一次获取一个路径)。 例如,您可以重构代码,以便在构造函数中调用 FindFirstFileW,并使用单独的方法来调用 FindNextFileW。

JNIEXPORT jstring JNICALL Java_javaxt_io_File_GetFiles(JNIEnv *env, jclass, jstring directory)
{
    HANDLE hFind;
    try {

      //Convert jstring to wstring
        const jchar *_directory = env->GetStringChars(directory, 0);
        jsize x = env->GetStringLength(directory);
        wstring path;  //L"C:\\temp\\*";
        path.assign(_directory, _directory + x);
        env->ReleaseStringChars(directory, _directory);

        if (x<2){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "Invalid path, less than 2 characters long.");
        }

        wstringstream ss;
        BOOL bContinue = TRUE;
        WIN32_FIND_DATAW data;
        hFind = FindFirstFileW(path.c_str(), &data);
        if (INVALID_HANDLE_VALUE == hFind){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "FindFirstFileW returned invalid handle.");
        }


        //HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
        //DWORD dwBytesWritten;


        // If we have no error, loop thru the files in this dir
        while (hFind && bContinue){

          /*
          //Debug Print Statment. DO NOT DELETE! cout and wcout do not print unicode correctly.
            WriteConsole(hStdOut, data.cFileName, (DWORD)_tcslen(data.cFileName), &dwBytesWritten, NULL);
            WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
            */

          //Check if this entry is a directory
            if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY){
                // Make sure this dir is not . or ..
                if (wstring(data.cFileName) != L"." &&
                    wstring(data.cFileName) != L"..")
                {   
                    ss << wstring(data.cFileName) << L"\\" << L"\n";
                }
            }
            else{
                ss << wstring(data.cFileName) << L"\n";
            }
            bContinue = FindNextFileW(hFind, &data);
        }   
        FindClose(hFind); // Free the dir structure



        wstring cstr = ss.str();
        int len = cstr.size();
        //WriteConsole(hStdOut, cstr.c_str(), len, &dwBytesWritten, NULL);
        //WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
        jchar* raw = new jchar[len];
        memcpy(raw, cstr.c_str(), len*sizeof(wchar_t));
        jstring result = env->NewString(raw, len);
        delete[] raw;
        return result;
    }
    catch(...){
        FindClose(hFind);
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception occured.");
    }

    return NULL;
}

信用:
https://sites.google.com/site/jozsefbekes/Home /windows-programming/miscellaneous-functions

即使采用这种方法,仍然可以提高效率。 如果将路径序列化为 java.io.File,则会对性能造成巨大影响 - 特别是当该路径表示网络驱动器上的文件时。 我不知道 Sun/Oracle 在幕后做什么,但如果您需要除文件路径之外的其他文件属性(例如大小、修改日期等),我发现以下 JNI 函数比实例化 java 快得多.io.File 对象在网络上的路径。

JNIEXPORT jlongArray JNICALL Java_javaxt_io_File_GetFileAttributesEx(JNIEnv *env, jclass, jstring filename)
{   

  //Convert jstring to wstring
    const jchar *_filename = env->GetStringChars(filename, 0);
    jsize len = env->GetStringLength(filename);
    wstring path;
    path.assign(_filename, _filename + len);
    env->ReleaseStringChars(filename, _filename);


  //Get attributes
    WIN32_FILE_ATTRIBUTE_DATA fileAttrs;
    BOOL result = GetFileAttributesExW(path.c_str(), GetFileExInfoStandard, &fileAttrs);
    if (!result) {
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception Occurred");
    }

  //Create an array to store the WIN32_FILE_ATTRIBUTE_DATA
    jlong buffer[6];
    buffer[0] = fileAttrs.dwFileAttributes;
    buffer[1] = date2int(fileAttrs.ftCreationTime);
    buffer[2] = date2int(fileAttrs.ftLastAccessTime);
    buffer[3] = date2int(fileAttrs.ftLastWriteTime);
    buffer[4] = fileAttrs.nFileSizeHigh;
    buffer[5] = fileAttrs.nFileSizeLow;

    jlongArray jLongArray = env->NewLongArray(6);
    env->SetLongArrayRegion(jLongArray, 0, 6, buffer);
    return jLongArray;
}

您可以在 javaxt-core 库中找到这种基于 JNI 的方法的完整工作示例。 在我使用 Java 1.6.0_38 和 Windows 主机访问 Windows 共享的测试中,我发现这种 JNI 方法比调用 java.io.File listFiles() 或 shelling out“dir”命令快大约 10 倍。

If you're on Java 1.5 or 1.6, shelling out "dir" commands and parsing the standard output stream on Windows is a perfectly acceptable approach. I've used this approach in the past for processing network drives and it has generally been a lot faster than waiting for the native java.io.File listFiles() method to return.

Of course, a JNI call should be faster and potentially safer than shelling out "dir" commands. The following JNI code can be used to retrieve a list of files/directories using the Windows API. This function can be easily refactored into a new class so the caller can retrieve file paths incrementally (i.e. get one path at a time). For example, you can refactor the code so that FindFirstFileW is called in a constructor and have a seperate method to call FindNextFileW.

JNIEXPORT jstring JNICALL Java_javaxt_io_File_GetFiles(JNIEnv *env, jclass, jstring directory)
{
    HANDLE hFind;
    try {

      //Convert jstring to wstring
        const jchar *_directory = env->GetStringChars(directory, 0);
        jsize x = env->GetStringLength(directory);
        wstring path;  //L"C:\\temp\\*";
        path.assign(_directory, _directory + x);
        env->ReleaseStringChars(directory, _directory);

        if (x<2){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "Invalid path, less than 2 characters long.");
        }

        wstringstream ss;
        BOOL bContinue = TRUE;
        WIN32_FIND_DATAW data;
        hFind = FindFirstFileW(path.c_str(), &data);
        if (INVALID_HANDLE_VALUE == hFind){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "FindFirstFileW returned invalid handle.");
        }


        //HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
        //DWORD dwBytesWritten;


        // If we have no error, loop thru the files in this dir
        while (hFind && bContinue){

          /*
          //Debug Print Statment. DO NOT DELETE! cout and wcout do not print unicode correctly.
            WriteConsole(hStdOut, data.cFileName, (DWORD)_tcslen(data.cFileName), &dwBytesWritten, NULL);
            WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
            */

          //Check if this entry is a directory
            if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY){
                // Make sure this dir is not . or ..
                if (wstring(data.cFileName) != L"." &&
                    wstring(data.cFileName) != L"..")
                {   
                    ss << wstring(data.cFileName) << L"\\" << L"\n";
                }
            }
            else{
                ss << wstring(data.cFileName) << L"\n";
            }
            bContinue = FindNextFileW(hFind, &data);
        }   
        FindClose(hFind); // Free the dir structure



        wstring cstr = ss.str();
        int len = cstr.size();
        //WriteConsole(hStdOut, cstr.c_str(), len, &dwBytesWritten, NULL);
        //WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
        jchar* raw = new jchar[len];
        memcpy(raw, cstr.c_str(), len*sizeof(wchar_t));
        jstring result = env->NewString(raw, len);
        delete[] raw;
        return result;
    }
    catch(...){
        FindClose(hFind);
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception occured.");
    }

    return NULL;
}

Credit:
https://sites.google.com/site/jozsefbekes/Home/windows-programming/miscellaneous-functions

Even with this approach, there are still efficiencies to be gained. If you serialize the path to a java.io.File, there is a huge performance hit - especially if the path represents a file on a network drive. I have no idea what Sun/Oracle is doing under the hood but if you need additional file attributes other than the file path (e.g. size, mod date, etc), I have found that the following JNI function is much faster than instantiating a java.io.File object on a network the path.

JNIEXPORT jlongArray JNICALL Java_javaxt_io_File_GetFileAttributesEx(JNIEnv *env, jclass, jstring filename)
{   

  //Convert jstring to wstring
    const jchar *_filename = env->GetStringChars(filename, 0);
    jsize len = env->GetStringLength(filename);
    wstring path;
    path.assign(_filename, _filename + len);
    env->ReleaseStringChars(filename, _filename);


  //Get attributes
    WIN32_FILE_ATTRIBUTE_DATA fileAttrs;
    BOOL result = GetFileAttributesExW(path.c_str(), GetFileExInfoStandard, &fileAttrs);
    if (!result) {
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception Occurred");
    }

  //Create an array to store the WIN32_FILE_ATTRIBUTE_DATA
    jlong buffer[6];
    buffer[0] = fileAttrs.dwFileAttributes;
    buffer[1] = date2int(fileAttrs.ftCreationTime);
    buffer[2] = date2int(fileAttrs.ftLastAccessTime);
    buffer[3] = date2int(fileAttrs.ftLastWriteTime);
    buffer[4] = fileAttrs.nFileSizeHigh;
    buffer[5] = fileAttrs.nFileSizeLow;

    jlongArray jLongArray = env->NewLongArray(6);
    env->SetLongArrayRegion(jLongArray, 0, 6, buffer);
    return jLongArray;
}

You can find a full working example of this JNI-based approach in the javaxt-core library. In my tests using Java 1.6.0_38 with a Windows host hitting a Windows share, I have found this JNI approach approximately 10x faster then calling java.io.File listFiles() or shelling out "dir" commands.

灰色世界里的红玫瑰 2024-07-17 05:24:45

我想知道为什么一个目录中有10k个文件。 某些文件系统不能很好地处理这么多文件。 文件系统有一些特定的限制,例如每个目录的最大文件数和子目录的最大级别数。

我用迭代器解决方案解决了类似的问题。

我需要递归地遍历巨大的目录和多层目录树。

我尝试 Apache commons io 的 FileUtils.iterateFiles() 。 但它通过将所有文件添加到 List 中然后返回 List.iterator() 来实现迭代器。 这对记忆力非常不好。

所以我更喜欢写这样的东西:

private static class SequentialIterator implements Iterator<File> {
    private DirectoryStack dir = null;
    private File current = null;
    private long limit;
    private FileFilter filter = null;

    public SequentialIterator(String path, long limit, FileFilter ff) {
        current = new File(path);
        this.limit = limit;
        filter = ff;
        dir = DirectoryStack.getNewStack(current);
    }

    public boolean hasNext() {
        while(walkOver());
        return isMore && (limit > count || limit < 0) && dir.getCurrent() != null;
    }

    private long count = 0;

    public File next() {
        File aux = dir.getCurrent();
        dir.advancePostition();
        count++;
        return aux;
    }

    private boolean walkOver() {
        if (dir.isOutOfDirListRange()) {
            if (dir.isCantGoParent()) {
                isMore = false;
                return false;
            } else {
                dir.goToParent();
                dir.advancePostition();
                return true;
            }
        } else {
            if (dir.isCurrentDirectory()) {
                if (dir.isDirectoryEmpty()) {
                    dir.advancePostition();
                } else {
                    dir.goIntoDir();
                }
                return true;
            } else {
                if (filter.accept(dir.getCurrent())) {
                    return false;
                } else {
                    dir.advancePostition();
                    return true;
                }
            }
        }
    }

    private boolean isMore = true;

    public void remove() {
        throw new UnsupportedOperationException();
    }

}

请注意,迭代器会停止迭代一定数量的文件,并且它还有一个 FileFilter 。

DirectoryStack 是:

public class DirectoryStack {
    private class Element{
        private File files[] = null;
        private int currentPointer;
        public Element(File current) {
            currentPointer = 0;
            if (current.exists()) {
                if(current.isDirectory()){
                    files = current.listFiles();
                    Set<File> set = new TreeSet<File>();
                    for (int i = 0; i < files.length; i++) {
                        File file = files[i];
                        set.add(file);
                    }
                    set.toArray(files);
                }else{
                    throw new IllegalArgumentException("File current must be directory");
                }
            } else {
                throw new IllegalArgumentException("File current not exist");
            }

        }
        public String toString(){
            return "current="+getCurrent().toString();
        }
        public int getCurrentPointer() {
            return currentPointer;
        }
        public void setCurrentPointer(int currentPointer) {
            this.currentPointer = currentPointer;
        }
        public File[] getFiles() {
            return files;
        }
        public File getCurrent(){
            File ret = null;
            try{
                ret = getFiles()[getCurrentPointer()];
            }catch (Exception e){
            }
            return ret;
        }
        public boolean isDirectoryEmpty(){
            return !(getFiles().length>0);
        }
        public Element advancePointer(){
            setCurrentPointer(getCurrentPointer()+1);
            return this;
        }
    }
    private DirectoryStack(File first){
        getStack().push(new Element(first));
    }
    public static DirectoryStack getNewStack(File first){
        return new DirectoryStack(first);
    }
    public String toString(){
        String ret = "stack:\n";
        int i = 0;
        for (Element elem : stack) {
            ret += "nivel " + i++ + elem.toString()+"\n";
        }
        return ret;
    }
    private Stack<Element> stack=null;
    private Stack<Element> getStack(){
        if(stack==null){
            stack = new Stack<Element>();
        }
        return stack;
    }
    public File getCurrent(){
        return getStack().peek().getCurrent();
    }
    public boolean isDirectoryEmpty(){
        return getStack().peek().isDirectoryEmpty();
    }
    public DirectoryStack downLevel(){
        getStack().pop();
        return this;
    }
    public DirectoryStack goToParent(){
        return downLevel();
    }
    public DirectoryStack goIntoDir(){
        return upLevel();
    }
    public DirectoryStack upLevel(){
        if(isCurrentNotNull())
            getStack().push(new Element(getCurrent()));
        return this;
    }
    public DirectoryStack advancePostition(){
        getStack().peek().advancePointer();
        return this;
    }
    public File[] peekDirectory(){
        return getStack().peek().getFiles();
    }
    public boolean isLastFileOfDirectory(){
        return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }
    public boolean gotMoreLevels() {
        return getStack().size()>0;
    }
    public boolean gotMoreInCurrentLevel() {
        return getStack().peek().getFiles().length > getStack().peek().getCurrentPointer()+1;
    }
    public boolean isRoot() {
        return !(getStack().size()>1);
    }
    public boolean isCurrentNotNull() {
        if(!getStack().isEmpty()){
            int currentPointer = getStack().peek().getCurrentPointer();
            int maxFiles = getStack().peek().getFiles().length;
            return currentPointer < maxFiles;
        }else{
            return false;
        }
    }
    public boolean isCurrentDirectory() {
        return getStack().peek().getCurrent().isDirectory();
    }
    public boolean isLastFromDirList() {
        return getStack().peek().getCurrentPointer() == (getStack().peek().getFiles().length-1);
    }
    public boolean isCantGoParent() {
        return !(getStack().size()>1);
    }
    public boolean isOutOfDirListRange() {
        return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }

}

I wonder why there are 10k files in a directory. Some file systems do not work well with so many files. There are specifics limitations for file systems like max amount of files per directory and max amount of levels of subdirectory.

I solve a similar problem with an iterator solution.

I needed to walk across huge directorys and several levels of directory tree recursively.

I try FileUtils.iterateFiles() of Apache commons io. But it implement the iterator by adding all the files in a List and then returning List.iterator(). It's very bad for memory.

So I prefer to write something like this:

private static class SequentialIterator implements Iterator<File> {
    private DirectoryStack dir = null;
    private File current = null;
    private long limit;
    private FileFilter filter = null;

    public SequentialIterator(String path, long limit, FileFilter ff) {
        current = new File(path);
        this.limit = limit;
        filter = ff;
        dir = DirectoryStack.getNewStack(current);
    }

    public boolean hasNext() {
        while(walkOver());
        return isMore && (limit > count || limit < 0) && dir.getCurrent() != null;
    }

    private long count = 0;

    public File next() {
        File aux = dir.getCurrent();
        dir.advancePostition();
        count++;
        return aux;
    }

    private boolean walkOver() {
        if (dir.isOutOfDirListRange()) {
            if (dir.isCantGoParent()) {
                isMore = false;
                return false;
            } else {
                dir.goToParent();
                dir.advancePostition();
                return true;
            }
        } else {
            if (dir.isCurrentDirectory()) {
                if (dir.isDirectoryEmpty()) {
                    dir.advancePostition();
                } else {
                    dir.goIntoDir();
                }
                return true;
            } else {
                if (filter.accept(dir.getCurrent())) {
                    return false;
                } else {
                    dir.advancePostition();
                    return true;
                }
            }
        }
    }

    private boolean isMore = true;

    public void remove() {
        throw new UnsupportedOperationException();
    }

}

Note that the iterator stop by an amount of files iterateds and it has a FileFilter also.

And DirectoryStack is:

public class DirectoryStack {
    private class Element{
        private File files[] = null;
        private int currentPointer;
        public Element(File current) {
            currentPointer = 0;
            if (current.exists()) {
                if(current.isDirectory()){
                    files = current.listFiles();
                    Set<File> set = new TreeSet<File>();
                    for (int i = 0; i < files.length; i++) {
                        File file = files[i];
                        set.add(file);
                    }
                    set.toArray(files);
                }else{
                    throw new IllegalArgumentException("File current must be directory");
                }
            } else {
                throw new IllegalArgumentException("File current not exist");
            }

        }
        public String toString(){
            return "current="+getCurrent().toString();
        }
        public int getCurrentPointer() {
            return currentPointer;
        }
        public void setCurrentPointer(int currentPointer) {
            this.currentPointer = currentPointer;
        }
        public File[] getFiles() {
            return files;
        }
        public File getCurrent(){
            File ret = null;
            try{
                ret = getFiles()[getCurrentPointer()];
            }catch (Exception e){
            }
            return ret;
        }
        public boolean isDirectoryEmpty(){
            return !(getFiles().length>0);
        }
        public Element advancePointer(){
            setCurrentPointer(getCurrentPointer()+1);
            return this;
        }
    }
    private DirectoryStack(File first){
        getStack().push(new Element(first));
    }
    public static DirectoryStack getNewStack(File first){
        return new DirectoryStack(first);
    }
    public String toString(){
        String ret = "stack:\n";
        int i = 0;
        for (Element elem : stack) {
            ret += "nivel " + i++ + elem.toString()+"\n";
        }
        return ret;
    }
    private Stack<Element> stack=null;
    private Stack<Element> getStack(){
        if(stack==null){
            stack = new Stack<Element>();
        }
        return stack;
    }
    public File getCurrent(){
        return getStack().peek().getCurrent();
    }
    public boolean isDirectoryEmpty(){
        return getStack().peek().isDirectoryEmpty();
    }
    public DirectoryStack downLevel(){
        getStack().pop();
        return this;
    }
    public DirectoryStack goToParent(){
        return downLevel();
    }
    public DirectoryStack goIntoDir(){
        return upLevel();
    }
    public DirectoryStack upLevel(){
        if(isCurrentNotNull())
            getStack().push(new Element(getCurrent()));
        return this;
    }
    public DirectoryStack advancePostition(){
        getStack().peek().advancePointer();
        return this;
    }
    public File[] peekDirectory(){
        return getStack().peek().getFiles();
    }
    public boolean isLastFileOfDirectory(){
        return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }
    public boolean gotMoreLevels() {
        return getStack().size()>0;
    }
    public boolean gotMoreInCurrentLevel() {
        return getStack().peek().getFiles().length > getStack().peek().getCurrentPointer()+1;
    }
    public boolean isRoot() {
        return !(getStack().size()>1);
    }
    public boolean isCurrentNotNull() {
        if(!getStack().isEmpty()){
            int currentPointer = getStack().peek().getCurrentPointer();
            int maxFiles = getStack().peek().getFiles().length;
            return currentPointer < maxFiles;
        }else{
            return false;
        }
    }
    public boolean isCurrentDirectory() {
        return getStack().peek().getCurrent().isDirectory();
    }
    public boolean isLastFromDirList() {
        return getStack().peek().getCurrentPointer() == (getStack().peek().getFiles().length-1);
    }
    public boolean isCantGoParent() {
        return !(getStack().size()>1);
    }
    public boolean isOutOfDirListRange() {
        return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }

}
污味仙女 2024-07-17 05:24:45

使用 Iterable 并不意味着文件将流式传输给您。 事实上,通常情况恰恰相反。 因此数组通常比 Iterable 更快。

Using an Iterable doesn't imply that the Files will be streamed to you. In fact its usually the opposite. So an array is typically faster than an Iterable.

装迷糊 2024-07-17 05:24:45

您确定这是由于 Java 造成的,而不仅仅是一个目录中有 10k 条目的普遍问题,特别是通过网络?

您是否尝试过编写一个概念验证程序,使用 win32 findfirst/findnext 函数在 C 中执行相同的操作,看看它是否更快?

我不知道 SMB 的来龙去脉,但我强烈怀疑它需要对列表中的每个文件进行往返 - 这不会很快,特别是在具有中等延迟的网络上。

在数组中包含 10k 字符串听起来也不会对现代 Java VM 造成太大负担。

Are you sure it's due to Java, not just a general problem with having 10k entries in one directory, particularly over the network?

Have you tried writing a proof-of-concept program to do the same thing in C using the win32 findfirst/findnext functions to see whether it's any faster?

I don't know the ins and outs of SMB, but I strongly suspect that it needs a round trip for every file in the list - which is not going to be fast, particularly over a network with moderate latency.

Having 10k strings in an array sounds like something which should not tax the modern Java VM too much either.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文