Java:有关处理大数据量的建议。 (双人部分)
好吧。 因此,我有大量的二进制数据(比方说 10GB)分布在一堆不同长度的文件(比方说 5000 个)上。
我正在编写一个Java应用程序来处理这些数据,我希望为数据访问制定一个良好的设计。 通常会发生这样的情况:
- 无论怎样,所有数据都将在处理过程中被读取。
- 每个文件(通常)都是按顺序读取的,一次只需要几千字节。 然而,通常需要同时拥有每个文件的前几千字节,或者同时拥有每个文件的中间几千字节等。
- 有时,应用程序需要随机访问到处都是一两个字节。
目前我正在使用 RandomAccessFile 类读入字节缓冲区(和 ByteBuffers)。 我的最终目标是将数据访问封装到某个类中,这样它就可以很快,而且我再也不用担心它了。 基本功能是我将要求它从指定文件中读取数据帧,并且考虑到上述考虑,我希望最大限度地减少 I/O 操作。
典型访问示例:
- 给我所有文件的前 10 KB!
- 给我文件 F 的字节 0 到 999,然后给我字节 1 到 1000,然后给我 2 到 1001,等等,等等...
- 从文件 F 的某个字节开始给我一兆字节的数据!
有什么好的设计建议吗?
Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
- One way or another, all the data will be read during the course of processing.
- Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
- There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
- Give me the first 10 kilobytes of all my files!
- Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
- Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
使用 Java NIO 和 MappedByteBuffers,并将文件视为字节数组列表。 然后,让操作系统关心缓存、读取、刷新等细节。
Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.
@Will
效果很好。 读取大型二进制文件快速比较:
测试 1 - 使用 RandomAccessFile 进行基本顺序读取。
2656 ms
测试 2 - 带缓冲的基本顺序读取。
47 ms
测试 3 - 使用 MappedByteBuffers 进行基本顺序读取以及进一步的帧缓冲优化。
16 ms
@Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms
哇。 您基本上是从头开始实现数据库。 是否有可能将数据导入到实际的 RDBMS 中并仅使用 SQL?
如果您自己这样做,您最终会想要实现某种缓存机制,因此您需要的数据来自 RAM(如果存在),并且您在较低层中读取和写入文件。
当然,这还需要大量复杂的事务逻辑,以确保您的数据保持一致。
Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.
我打算建议您跟进 Eric 的数据库想法并了解数据库如何管理其缓冲区——有效地实现自己的虚拟内存管理。
但当我进一步思考后,我得出的结论是,与没有 Java 低级访问的情况相比,大多数操作系统在实现文件系统缓存方面已经做得更好了。
不过,您可能会考虑数据库缓冲区管理的一个教训。 数据库利用对查询计划的理解来优化管理策略。
在关系数据库中,通常最好从缓存中逐出最近使用的块。 例如,在连接中保存子记录的“年轻”块将不会被再次查看,而包含其父记录的块仍在使用中,即使它是“较旧的”。
另一方面,操作系统文件缓存经过优化以重用最近使用的数据(并提前读取最近使用的数据)。 如果您的应用程序不符合该模式,则可能值得您自己管理缓存。
I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.
您可能想看看一个名为 jdbm 的开源简单对象数据库 - 它有很多开发的这种东西,包括ACID能力。
我已经为该项目做出了许多贡献,如果没有其他办法来了解我们如何解决您可能正在解决的许多相同问题,那么值得回顾一下源代码。
现在,如果您的数据文件不在您的控制之下(即您正在解析其他人生成的文本文件等...),那么 jdbm 使用的页面结构存储类型可能不适合您 - 但如果所有这些文件是您正在创建和使用的文件,可能值得一看。
You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.
@Eric
但我的查询将比我用 SQL 做的任何事情都要简单得多。 数据库访问不会比二进制数据读取昂贵得多吗?
@Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?
这是为了回答有关最小化 I/O 流量的部分。 在 Java 方面,您真正能做的就是将读者包装在 BufferedReaders 中。 除此之外,您的操作系统还将处理其他优化,例如将最近读取的数据保留在页面缓存中以及对文件进行预读以加快顺序读取速度。 在 Java 中进行额外的缓冲是没有意义的(尽管您仍然需要一个字节缓冲区来将数据返回给客户端)。
This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).
就在前几天,有人向我推荐了 hadoop (http://hadoop.apache.org)。 看起来它可能相当不错,并且可能具有一定的市场吸引力。
I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.
我会退后一步问自己为什么使用文件作为记录系统,以及与使用数据库相比有什么好处。 数据库无疑使您能够构建数据。 鉴于 SQL 标准,从长远来看它可能更易于维护。
另一方面,在数据库的约束下,您的文件数据可能不会那么容易构建。 世界上最大的搜索公司:) 不使用数据库进行业务处理。 请参阅此处和此处。
I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.