旋转大数据文件
我有一些大型制表符分隔数据文件。这些文件的行数比列数多几个数量级。问题是我想旋转这些文件,但在这种情况下,“大”被定义为太大而无法在内存中执行此操作。
我希望找到一些关于最快方法的建议。我主要在 UNIX 上使用 Java 工作,尽管如果出现更快的特定于语言的解决方案(或使用 awk 等的解决方案),我也会对此持开放态度。
目前我们正在内存中执行此操作,但随着时间的推移,文件超出了我们的内存容量。显然“买一台更大的机器”是一个解决方案,但目前还不可能。
I have some large tab delimited data files.These files will have a few orders of magnitude more rows than columns. The problem is that I'd like to pivot these files, but in this case "large" is being defined as being too big to do this in memory.
I was hoping to find some suggestions on the fastest way of doing this. I'm primarily working in Java on UNIX, although if a faster language specific solution were to arise (or something using awk, etc) I'd be open to that as well.
Currently we're doing this in memory but as things evolve over time the files are exceeding our memory capacities. Obviously "buy a larger machine" is a solution, but not in the cards at the moment.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
像下面这样的东西可能适合你。此代码首先将源文件作为
BufferedReader
打开,然后读取第一行并将其拆分为\t
。结果数组的长度是目标文件的行数。创建一个新的
FileHolder
对象数组,其中FileHolder
基本上保存一个文件描述符和一个用作缓存的ByteBuffer
(以免写出每一个字)。创建所有持有者后,将写入第一行。然后再次读取源文件,再次逐行分割,直到为空,并附加所有文件持有者。
完成后,(最后)创建目标文件,并且所有 FileHolder 实例都按数组顺序(即行顺序)写入其中。
这是示例代码(很长,也可在此处获取)。它当然可以改进(资源没有真正关闭在正确的位置等),但它有效。它在大约 25 秒内转置了一个 275 MB 的文件(四核 Q6600、4 GB RAM、x86_64 Linux 3.1.2-rc5),并使用 Sun(64 位)JDK 的“脆弱”默认值 64 MB 运行:
Something like the below may work for you. This code first opens the source file as a
BufferedReader
, then reads the first line and splits it against\t
.The resulting array's length is the number of lines of the destination file. A new array of
FileHolder
objects is created, where aFileHolder
basically holds a file descriptor and aByteBuffer
to use as a cache (so as not to write each and every word). When all holders are created, the first line is written.Then the source file is read again, split again, line by line, until empty, and all file holders appended to.
When done, the destination file is created (at last) and all FileHolder instances are written to it in the array order, therefore in line order.
Here is a sample code (LONG, also available here). It can certainly be improved (resources are not really closed at the correct place etc) but it works. It transposes a 275 MB file here in around 25 seconds (quad core Q6600, 4 GB RAM, x86_64 Linux 3.1.2-rc5), and runs with the "flimsy" default value of 64 MB of Sun's (64bit) JDK:
@fge:
1)最好使用 CharBuffer 而不是实例化大量字符串。
2)最好像这样使用模式匹配:
因为,当您查看内部时
,始终避免编写导致大量实例化或字符串使用的代码。这会导致大量内存使用,从而导致性能消耗。
@fge:
1) It is better to use CharBuffer instead of instantiating lot of Strings.
2) It is better to use Pattern Matching like this:
because, when you look inside
Always refrain from writing code that causes a lot of instantiation or String usage. This causes a lot of memory usage which causes a performance drain.