如何使用 C# 高效合并巨大文件

发布于 2024-09-15 10:00:12 字数 212 浏览 5 评论 0原文

我有超过 125 个 TSV 文件,每个文件大约 100Mb,我想要合并。合并操作允许破坏125个文件,但不能破坏数据。重要的是,最后我得到了一个大文件,其中所有文件的内容一个接一个(没有特定的顺序)。

有没有有效的方法来做到这一点?我想知道 Windows 是否提供了一个 API 来简单地将所有这些文件创建一个大的“联合”?不然的话,我就得把所有文件读完,写一大堆。

谢谢!

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).

Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

有深☉意 2024-09-22 10:00:12

那么“合并”真的只是一个接一个地写入文件吗?这非常简单 - 只需打开一个输出流,然后重复打开一个输入流,复制数据,然后关闭。例如:

static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
    using (Stream output = File.OpenWrite(outputFile))
    {
        foreach (string inputFile in inputFiles)
        {
            using (Stream input = File.OpenRead(inputFile))
            {
                input.CopyTo(output);
            }
        }
    }
}

使用 .NET 4 中新增的 Stream.CopyTo 方法。如果您不使用 .NET 4,则另一个辅助方法会派上用场:

private static void CopyStream(Stream input, Stream output)
{
    byte[] buffer = new byte[8192];
    int bytesRead;
    while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
    {
        output.Write(buffer, 0, bytesRead);
    }
}

我不知道什么其中比这更有效...但重要的是,这根本不会占用您系统上的太多内存。这并不像是重复将整个文件读入内存然后再次将其全部写出。

编辑:正如评论中所指出的,您可以通过多种方式修改文件选项,可能使其在文件系统处理数据方面更加高效。但从根本上讲,无论哪种方式,您都将读取数据并写入数据,一次一个缓冲区。

So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:

static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
    using (Stream output = File.OpenWrite(outputFile))
    {
        foreach (string inputFile in inputFiles)
        {
            using (Stream input = File.OpenRead(inputFile))
            {
                input.CopyTo(output);
            }
        }
    }
}

That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:

private static void CopyStream(Stream input, Stream output)
{
    byte[] buffer = new byte[8192];
    int bytesRead;
    while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
    {
        output.Write(buffer, 0, bytesRead);
    }
}

There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.

EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.

花辞树 2024-09-22 10:00:12

从命令行执行此操作:

copy 1.txt+2.txt+3.txt combined.txt

copy *.txt combined.txt

Do it from the command line:

copy 1.txt+2.txt+3.txt combined.txt

or

copy *.txt combined.txt
撩发小公举 2024-09-22 10:00:12

您的意思是合并您想用一些自定义逻辑来决定哪些行去哪里?或者你的意思是你主要想将这些文件连接成一个大文件?

对于后一种情况,您可能根本不需要以编程方式执行此操作,只需使用此生成一个批处理文件(/b 用于二进制文件,如果不需要,请删除):

copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"

使用 C#,我会采取以下方法。编写一个复制两个流的简单函数:

void CopyStreamToStream(Stream dest, Stream src)
{
    int bytesRead;

    // experiment with the best buffer size, often 65536 is very performant
    byte[] buffer = new byte[GOOD_BUFFER_SIZE];

    // copy everything
    while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
    {
        dest.Write(buffer, 0, bytesRead);
    }
}

// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);

Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?

In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (/b is for binary, remove if not needed):

copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"

Using C#, I'd take the following approach. Write a simple function that copies two streams:

void CopyStreamToStream(Stream dest, Stream src)
{
    int bytesRead;

    // experiment with the best buffer size, often 65536 is very performant
    byte[] buffer = new byte[GOOD_BUFFER_SIZE];

    // copy everything
    while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
    {
        dest.Write(buffer, 0, bytesRead);
    }
}

// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);
薄荷梦 2024-09-22 10:00:12

使用总计约 12GB 的 100MB 文本文件文件夹,我发现通过使用 File.ReadAllBytes 然后将其写入流,可以比接受的答案节省一点时间。

        [Test]
        public void RaceFileMerges()
        {
            var inputFilesPath = @"D:\InputFiles";
            var inputFiles = Directory.EnumerateFiles(inputFilesPath).ToArray();

            var sw = new Stopwatch();
            sw.Start();

            ConcatenateFilesUsingReadAllBytes(@"D:\ReadAllBytesResult", inputFiles);

            Console.WriteLine($"ReadAllBytes method in {sw.Elapsed}");

            sw.Reset();
            sw.Start();

            ConcatenateFiles(@"D:\CopyToResult", inputFiles);

            Console.WriteLine($"CopyTo method in {sw.Elapsed}");
        }

        private static void ConcatenateFiles(string outputFile, params string[] inputFiles)
        {
            using (var output = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    using (var input = File.OpenRead(inputFile))
                    {
                        input.CopyTo(output);
                    }
                }
            }
        }

        private static void ConcatenateFilesUsingReadAllBytes(string outputFile, params string[] inputFiles)
        {
            using (var stream = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    var currentBytes = File.ReadAllBytes(inputFile);
                    stream.Write(currentBytes, 0, currentBytes.Length);
                }
            }
        }

00:01:22.2753300 中的 ReadAllBytes 方法

00:01:30.3122215 中的 CopyTo 方法

我重复了多次,得到了类似的结果。

Using a folder of 100MB text files totalling ~12GB, I found that a small time saving could be made over the accepted answer by using File.ReadAllBytes and then writing that out to the stream.

        [Test]
        public void RaceFileMerges()
        {
            var inputFilesPath = @"D:\InputFiles";
            var inputFiles = Directory.EnumerateFiles(inputFilesPath).ToArray();

            var sw = new Stopwatch();
            sw.Start();

            ConcatenateFilesUsingReadAllBytes(@"D:\ReadAllBytesResult", inputFiles);

            Console.WriteLine($"ReadAllBytes method in {sw.Elapsed}");

            sw.Reset();
            sw.Start();

            ConcatenateFiles(@"D:\CopyToResult", inputFiles);

            Console.WriteLine($"CopyTo method in {sw.Elapsed}");
        }

        private static void ConcatenateFiles(string outputFile, params string[] inputFiles)
        {
            using (var output = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    using (var input = File.OpenRead(inputFile))
                    {
                        input.CopyTo(output);
                    }
                }
            }
        }

        private static void ConcatenateFilesUsingReadAllBytes(string outputFile, params string[] inputFiles)
        {
            using (var stream = File.OpenWrite(outputFile))
            {
                foreach (var inputFile in inputFiles)
                {
                    var currentBytes = File.ReadAllBytes(inputFile);
                    stream.Write(currentBytes, 0, currentBytes.Length);
                }
            }
        }

ReadAllBytes method in 00:01:22.2753300

CopyTo method in 00:01:30.3122215

I repeated this a number of times with similar results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文