如何使用 C# 高效合并巨大文件
我有超过 125 个 TSV 文件,每个文件大约 100Mb,我想要合并。合并操作允许破坏125个文件,但不能破坏数据。重要的是,最后我得到了一个大文件,其中所有文件的内容一个接一个(没有特定的顺序)。
有没有有效的方法来做到这一点?我想知道 Windows 是否提供了一个 API 来简单地将所有这些文件创建一个大的“联合”?不然的话,我就得把所有文件读完,写一大堆。
谢谢!
I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).
Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
那么“合并”真的只是一个接一个地写入文件吗?这非常简单 - 只需打开一个输出流,然后重复打开一个输入流,复制数据,然后关闭。例如:
使用 .NET 4 中新增的 Stream.CopyTo 方法。如果您不使用 .NET 4,则另一个辅助方法会派上用场:
我不知道什么其中比这更有效...但重要的是,这根本不会占用您系统上的太多内存。这并不像是重复将整个文件读入内存然后再次将其全部写出。
编辑:正如评论中所指出的,您可以通过多种方式修改文件选项,可能使其在文件系统处理数据方面更加高效。但从根本上讲,无论哪种方式,您都将读取数据并写入数据,一次一个缓冲区。
So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:
That's using the
Stream.CopyTo
method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.
EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.
从命令行执行此操作:
或
Do it from the command line:
or
您的意思是合并您想用一些自定义逻辑来决定哪些行去哪里?或者你的意思是你主要想将这些文件连接成一个大文件?
对于后一种情况,您可能根本不需要以编程方式执行此操作,只需使用此生成一个批处理文件(
/b
用于二进制文件,如果不需要,请删除):使用 C#,我会采取以下方法。编写一个复制两个流的简单函数:
Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?
In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (
/b
is for binary, remove if not needed):Using C#, I'd take the following approach. Write a simple function that copies two streams:
使用总计约 12GB 的 100MB 文本文件文件夹,我发现通过使用 File.ReadAllBytes 然后将其写入流,可以比接受的答案节省一点时间。
我重复了多次,得到了类似的结果。
Using a folder of 100MB text files totalling ~12GB, I found that a small time saving could be made over the accepted answer by using
File.ReadAllBytes
and then writing that out to the stream.I repeated this a number of times with similar results.