因此,我有“大量”的“非常大”的 ASCII 文件的数值数据(总共千兆字节),并且我的程序需要至少按顺序处理整个文件一次。
关于存储/加载数据有什么建议吗? 我考虑过将文件转换为二进制文件,以使它们更小并更快地加载。
我应该将所有内容一次性加载到内存中吗?
如果没有,打开部分加载数据的好方法是什么?
有哪些与 Java 相关的效率技巧?
So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
发布评论
评论(11)
您可以转换为二进制,但是如果您需要保留原始数据,那么您将拥有 1+ 个数据副本。
在原始 ascii 数据之上构建某种索引可能是实用的,这样如果您需要再次浏览数据,您可以在后续时间更快地完成。
按顺序回答您的问题:
如果没有必要的话就不会。 对于某些文件,您也许可以,但如果您只是按顺序处理,只需对文件进行某种缓冲读取,逐个读取,并在此过程中存储您需要的任何内容。
BufferedReaders/etc 是最简单的,尽管您可以更深入地研究 FileChannel/etc 以使用内存映射 I/O 一次遍历数据窗口。
这实际上取决于您对数据本身所做的事情!
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
That really depends on what you're doing with the data itself!
我非常喜欢“内存映射 I/O”,又名“直接字节缓冲区”。 在 Java 中,它们被称为 映射字节缓冲区< /a> 是 java.nio 的一部分。 (基本上,这种机制使用操作系统的虚拟内存分页系统来“映射”您的文件,并以编程方式将它们呈现为字节缓冲区。操作系统将自动且快速地管理将字节移入/移出磁盘和内存。
我建议这种方法因为a)它对我有用,b)它会让你专注于你的算法,并让JVM、操作系统和硬件来处理性能优化。 通常,他们比我们这些底层程序员更知道什么是最好的。 ;)
您将如何在您的环境中使用 MBB? 只需为每个文件创建一个 MBB 并根据需要阅读它们即可。 您只需要存储您的结果。 。
顺便说一句:您正在处理多少数据(以 GB 为单位)? 如果它超过 3-4GB,那么这在 32 位机器上将不起作用,因为 MBB 实现在平台架构的可寻址内存空间上受到限制。 64 位机器 & 操作系统将带您获取 1TB 或 128TB 的可映射数据。
如果您正在考虑性能,那么请认识 Kirk Pepperdine(一位颇有名气的 Java 性能大师)。他参与了一个网站 www.JavaPerformanceTuning.com,该网站提供了更多 MBB 详细信息: NIO 性能提示 以及其他 Java 性能相关的内容。
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
您可能想查看 中的条目Wide Finder 项目(在 google 上搜索 “广角取景器”java)。
Wide finder 需要读取日志文件中的大量行,因此请查看 Java 实现,看看哪些内容有效,哪些内容无效。
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
在没有对正在进行的处理类型进行任何额外了解的情况下,以下是我完成类似工作时的一些一般想法。
编写一个应用程序原型(甚至可能是“扔掉的一个”),对数据集执行一些任意操作。 看看它跑得多快。 如果您能想到的最简单、最天真的事情是可以接受的快,不用担心!
如果简单的方法不起作用,请考虑预处理数据,以便后续运行能够在可接受的时间长度内运行。 您提到必须在数据集中“跳转”很多次。 有没有什么方法可以对其进行预处理? 或者,一个预处理步骤可以是生成更多数据(索引数据),提供有关数据集的关键、必要部分的字节精确位置信息。 然后,您的主处理运行可以利用此信息直接跳转到必要的数据。
因此,总而言之,我的方法是立即尝试一些简单的方法,看看性能如何。 也许会好起来的。 否则,考虑分多个步骤处理数据,节省最昂贵的操作用于不频繁的预处理。
不要“将所有内容加载到内存中”。 只需执行文件访问,并让操作系统的磁盘页面缓存决定何时实际直接从内存中提取内容。
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
这在很大程度上取决于文件中的数据。 大型机长期以来一直在进行顺序数据处理,但它们通常不使用随机访问数据。 他们只是一次将其拉成一行并在继续之前处理那么多。
对于随机访问,通常最好使用缓存包装器构建对象,这些包装器知道它们需要构建的数据在文件中的位置。 当需要时,他们读取数据并构建自己。 这样,当内存紧张时,您可以开始删除一些东西,而不必太担心以后无法恢复它。
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
我强烈建议利用正则表达式并研究“新”IO nio 包以加快输入速度。 然后它的传输速度应该与您实际期望的千兆字节数据传输速度一样快。
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
您确实没有向我们提供足够的信息来帮助您。 您是否需要完整加载每个文件才能处理它? 或者你可以逐行处理它吗?
一次加载整个文件可能会导致性能不佳,即使文件不是很大。 最好的选择是定义适合您的缓冲区大小,并一次读取/处理一个缓冲区的数据。
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
我发现 Informatica 是一个非常有用的数据处理工具。 好消息是,最新版本甚至允许 Java 转换。 如果您正在处理 TB 级的数据,那么可能是时候花钱购买同类最佳的 ETL 工具了。
我假设您想对这里的处理结果做一些事情,比如将其存储在某个地方。
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
如果您的数值数据是定期采样的并且您需要进行随机访问,请考虑将它们存储在 四叉树。
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
如果可能的话,将数据存入数据库。 然后,您可以利用所有可用的索引、缓存、内存固定和其他功能。
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
如果您需要多次访问数据,请将其加载到数据库中。 大多数数据库都有某种批量加载实用程序。 如果数据可以全部放入内存中,并且您不需要保留它或经常访问它,那么您可能可以用 Perl 或您最喜欢的脚本语言编写一些简单的东西。
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.