我正在尝试一次处理一个通过网络存储的文件。 由于缓冲不是问题,因此读取文件速度很快。 我遇到的问题只是列出文件夹中的目录。 我在许多文件夹中每个文件夹至少有 10k 个文件。
由于 File.list() 返回一个数组而不是一个可迭代对象,因此性能非常慢。 Java 启动并收集文件夹中的所有名称,并将其打包到一个数组中,然后返回。
此错误条目为 https://bugs.java.com/ bugdatabase/view_bug;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 并且没有解决方法。 他们只是说这个问题已经在 JDK7 中修复了。
有几个问题:
- 有人有解决这个性能瓶颈的方法吗?
- 我是否正在努力实现不可能的目标? 即使只是迭代目录,性能仍然会很差吗?
- 我可以使用具有此功能的 beta JDK7 版本,而不必在其上构建整个项目吗?
I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing the directories in a folder. I have at least 10k files per folder over many folders.
Performance is super slow since File.list() returns an array instead of an iterable. Java goes off and collects all the names in a folder and packs it into an array before returning.
The bug entry for this is https://bugs.java.com/bugdatabase/view_bug;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 and doesn't have a work around. They just say this has been fixed for JDK7.
A few questions:
- Does anybody have a workaround to this performance bottleneck?
- Am I trying to achieve the impossible? Is performance still going to be poor even if it just iterates over the directories?
- Could I use the beta JDK7 builds that have this functionality without having to build my entire project on it?
发布评论
评论(10)
虽然它不太漂亮,但我通过在启动应用程序之前将 dir/ls 的输出通过管道传输到文件并传入文件名来解决了此类问题。
如果您需要在应用程序内执行此操作,则可以使用 system.exec(),但它会造成一些麻烦。
你问。 第一种形式将会非常快,第二种形式也应该相当快。
请务必执行每行一项(裸、无装饰、无图形)、所选命令的完整路径和递归选项。
编辑:
30 分钟只是为了获得目录列表,哇。
让我震惊的是,如果您使用 exec(),您可以将其标准输出重定向到管道中,而不是将其写入文件中。
如果您这样做了,您应该立即开始获取文件,并能够在命令完成之前开始处理。
这种互动实际上可能会减慢速度,但也许不会——你可以尝试一下。
哇,我刚刚为您找到了 .exec 命令的语法,并发现了这个,可能正是您想要的(它使用 exec 和“ls”列出了一个目录,并将结果通过管道传输到您的程序中进行处理): 回程中的良好链接 (Jörg 在评论中提供了替换 无论如何,这个想法很简单,
但是正确地编写代码却很烦人。 我会从互联网上窃取一些代码并破解它们 --brb
感谢 IBM
Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.
If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.
You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.
Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.
EDIT:
30 minutes just to get a directory listing, wow.
It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.
If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.
The interaction may actually slow things down, but maybe not--you might give it a try.
Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing): good link in wayback (Jörg provided in a comment to replace this one from sun that Oracle broke)
Anyway, the idea is straightforward but getting the code right is annoying. I'll go steal some codes from the internets and hack them up--brb
And thank you code donor at IBM
如何使用 File.list(FilenameFilter filter) 方法并实现 FilenameFilter.accept(File dir, String name) 来处理每个文件并返回 false。
我在 Linux 虚拟机上针对包含 10K 以上文件的目录运行了此命令,花费了不到 10 秒的时间。
How about using File.list(FilenameFilter filter) method and implementing FilenameFilter.accept(File dir, String name) to process each file and return false.
I ran this on Linux vm for directory with 10K+ files and it took <10 seconds.
另一种方法是通过不同的协议提供文件。 据我了解,您正在使用 SMB 来实现这一点,而 java 只是试图将它们列为常规文件。
这里的问题可能不仅仅是java(当您使用Microsoft Explorer x:\shared 打开该目录时它的行为如何)根据我的经验,它也需要相当多的时间。
您可以将协议更改为 HTTP 之类的协议,仅用于获取文件名。 这样你就可以通过 http 检索文件列表(10k 行应该不会太多)并让服务器处理文件列表。 这将非常快,因为它将使用本地资源(服务器中的资源)运行
然后当您拥有列表时,您可以按照您现在正在做的方式处理它们。
关键是在节点的另一端要有一个援助机制。
这可行吗?
今天:
建议:
http 服务器可以是一个非常小的、简单的文件。
如果这是你现在的方式,那么你要做的就是将所有 10k 文件信息获取到你的客户端计算机(我不知道有多少信息),而你只需要文件名以供以后处理。
如果现在处理速度非常快,则可能会减慢一点。 这是因为预取的信息不再可用。
试一试。
An alternative is to have the files served over a different protocol. As I understand you're using SMB for that and java is just trying to list them as a regular file.
The problem here might not be java alone ( how does it behaves when you open that directory with Microsoft Explorer x:\shared ) In my experience it also take a considerably amount of time.
You can change the protocol to something like HTTP, only to fetch the file names. This way you can retrieve the list of files over http ( 10k lines should't be too much ) and let the server deal with file listing. This would be very fast, since it will run with local resources ( those in the server )
Then when you have the list, you can process them one by exactly the way you're doing right now.
The keypoint is to have an aid mechanism in the other side of the node.
Is this feasible?
Today:
Proposed:
The http server could be a very small small and simple file.
If this is the way you have it right now, what you're doing is to fetch all the 10k files information to your client machine ( I don't know how much of that info ) when you only need the file name for later processing.
If the processing is very fast right now it may be slowed down a bit. This is because the information prefetched is no longer available.
Give it a try.
不可移植的解决方案是对操作系统进行本机调用并传输结果。
对于 Linux
您可以查看类似 readdir 的内容。 您可以像链接列表一样遍历目录结构,并批量或单独返回结果。
对于 Windows
在 Windows 中,使用 FindFirstFile 和 FindNextFile< /a> api.
A non-portable solution would be to make native calls to the operating system and stream the results.
For Linux
You can look at something like readdir. You can walk the directory structure like a linked list and return results in batches or individually.
For Windows
In windows the behavior would be fairly similar using FindFirstFile and FindNextFile apis.
我怀疑该问题与您引用的错误报告有关。
问题“仅”在于内存使用,但不一定是速度。
如果您有足够的内存,则该错误与您的问题无关。
您应该衡量您的问题是否与内存相关。 打开垃圾收集器日志并使用例如 gcviewer 来分析您的内存使用情况。
我怀疑这与导致问题的 SMB 协议有关。
您可以尝试用另一种语言编写测试,看看它是否更快,或者您可以尝试通过其他方法获取文件名列表,例如另一篇文章中描述的方法。
I doubt the problem is relate to the bug report you referenced.
The issue there is "only" memory usage, but not necessarily speed.
If you have enough memory the bug is not relevant for your problem.
You should measure whether your problem is memory related or not. Turn on your Garbage Collector log and use for example gcviewer to analyze your memory usage.
I suspect that it has to do with the SMB protocol causing the problem.
You can try to write a test in another language and see if it's faster, or you can try to get the list of filenames through some other method, such as described here in another post.
如果您最终需要处理所有文件,那么使用 Iterable 而不是 String[] 不会给您带来任何优势,因为您仍然需要获取整个文件列表。
If you need to eventually process all files, then having Iterable over String[] won't give you any advantage, as you'll still have to go and fetch the whole list of files.
如果您使用的是 Java 1.5 或 1.6,则在 Windows 上删除“dir”命令并解析标准输出流是一种完全可以接受的方法。 我过去曾使用这种方法来处理网络驱动器,它通常比等待本机 java.io.File listFiles() 方法返回要快得多。
当然,JNI 调用应该比 shell 出“dir”命令更快并且可能更安全。 以下 JNI 代码可用于使用 Windows API 检索文件/目录列表。 该函数可以轻松地重构为一个新类,以便调用者可以增量检索文件路径(即一次获取一个路径)。 例如,您可以重构代码,以便在构造函数中调用 FindFirstFileW,并使用单独的方法来调用 FindNextFileW。
信用:
https://sites.google.com/site/jozsefbekes/Home /windows-programming/miscellaneous-functions
即使采用这种方法,仍然可以提高效率。 如果将路径序列化为 java.io.File,则会对性能造成巨大影响 - 特别是当该路径表示网络驱动器上的文件时。 我不知道 Sun/Oracle 在幕后做什么,但如果您需要除文件路径之外的其他文件属性(例如大小、修改日期等),我发现以下 JNI 函数比实例化 java 快得多.io.File 对象在网络上的路径。
您可以在 javaxt-core 库中找到这种基于 JNI 的方法的完整工作示例。 在我使用 Java 1.6.0_38 和 Windows 主机访问 Windows 共享的测试中,我发现这种 JNI 方法比调用 java.io.File listFiles() 或 shelling out“dir”命令快大约 10 倍。
If you're on Java 1.5 or 1.6, shelling out "dir" commands and parsing the standard output stream on Windows is a perfectly acceptable approach. I've used this approach in the past for processing network drives and it has generally been a lot faster than waiting for the native java.io.File listFiles() method to return.
Of course, a JNI call should be faster and potentially safer than shelling out "dir" commands. The following JNI code can be used to retrieve a list of files/directories using the Windows API. This function can be easily refactored into a new class so the caller can retrieve file paths incrementally (i.e. get one path at a time). For example, you can refactor the code so that FindFirstFileW is called in a constructor and have a seperate method to call FindNextFileW.
Credit:
https://sites.google.com/site/jozsefbekes/Home/windows-programming/miscellaneous-functions
Even with this approach, there are still efficiencies to be gained. If you serialize the path to a java.io.File, there is a huge performance hit - especially if the path represents a file on a network drive. I have no idea what Sun/Oracle is doing under the hood but if you need additional file attributes other than the file path (e.g. size, mod date, etc), I have found that the following JNI function is much faster than instantiating a java.io.File object on a network the path.
You can find a full working example of this JNI-based approach in the javaxt-core library. In my tests using Java 1.6.0_38 with a Windows host hitting a Windows share, I have found this JNI approach approximately 10x faster then calling java.io.File listFiles() or shelling out "dir" commands.
我想知道为什么一个目录中有10k个文件。 某些文件系统不能很好地处理这么多文件。 文件系统有一些特定的限制,例如每个目录的最大文件数和子目录的最大级别数。
我用迭代器解决方案解决了类似的问题。
我需要递归地遍历巨大的目录和多层目录树。
我尝试 Apache commons io 的 FileUtils.iterateFiles() 。 但它通过将所有文件添加到 List 中然后返回 List.iterator() 来实现迭代器。 这对记忆力非常不好。
所以我更喜欢写这样的东西:
请注意,迭代器会停止迭代一定数量的文件,并且它还有一个 FileFilter 。
DirectoryStack 是:
I wonder why there are 10k files in a directory. Some file systems do not work well with so many files. There are specifics limitations for file systems like max amount of files per directory and max amount of levels of subdirectory.
I solve a similar problem with an iterator solution.
I needed to walk across huge directorys and several levels of directory tree recursively.
I try FileUtils.iterateFiles() of Apache commons io. But it implement the iterator by adding all the files in a List and then returning List.iterator(). It's very bad for memory.
So I prefer to write something like this:
Note that the iterator stop by an amount of files iterateds and it has a FileFilter also.
And DirectoryStack is:
使用 Iterable 并不意味着文件将流式传输给您。 事实上,通常情况恰恰相反。 因此数组通常比 Iterable 更快。
Using an Iterable doesn't imply that the Files will be streamed to you. In fact its usually the opposite. So an array is typically faster than an Iterable.
您确定这是由于 Java 造成的,而不仅仅是一个目录中有 10k 条目的普遍问题,特别是通过网络?
您是否尝试过编写一个概念验证程序,使用 win32 findfirst/findnext 函数在 C 中执行相同的操作,看看它是否更快?
我不知道 SMB 的来龙去脉,但我强烈怀疑它需要对列表中的每个文件进行往返 - 这不会很快,特别是在具有中等延迟的网络上。
在数组中包含 10k 字符串听起来也不会对现代 Java VM 造成太大负担。
Are you sure it's due to Java, not just a general problem with having 10k entries in one directory, particularly over the network?
Have you tried writing a proof-of-concept program to do the same thing in C using the win32 findfirst/findnext functions to see whether it's any faster?
I don't know the ins and outs of SMB, but I strongly suspect that it needs a round trip for every file in the list - which is not going to be fast, particularly over a network with moderate latency.
Having 10k strings in an array sounds like something which should not tax the modern Java VM too much either.