如何使用.NET快速比较2个文件?

发布于 2024-08-03 00:12:27 字数 186 浏览 1 评论 0原文

典型方法建议通过 FileStream 读取二进制文件并逐字节进行比较。

  • CRC 之类的校验和比较会更快吗?
  • 是否有任何 .NET 库可以为文件生成校验和?

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.

  • Would a checksum comparison such as CRC be faster?
  • Are there any .NET libraries that can generate a checksum for a file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(21

不即不离 2024-08-10 00:12:27

最慢的方法是逐字节比较两个文件。我能想到的最快的方法是类似的比较,但不是一次一个字节,而是使用大小为 Int64 的字节数组,然后比较结果数字。

以下是我的想法:

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

在我的测试中,我发现它的性能比简单的 ReadByte() 场景快了 3:1。平均运行超过 1000 次,我得到这个方法的时间为 1063 毫秒,而下面的方法(直接逐字节比较)的时间为 3031 毫秒。散列总是以亚秒级的速度返回,平均时间约为 865 毫秒。此测试使用约 100MB 的视频文件。

以下是我使用的 ReadByte 和哈希方法,用于比较:

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }

The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.

Here's what I came up with:

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.

Here's the ReadByte and hashing methods I used, for comparison purposes:

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
圈圈圆圆圈圈 2024-08-10 00:12:27

校验和比较很可能比逐字节比较慢。

为了生成校验和,您需要加载文件的每个字节,并对其执行处理。然后您必须对第二个文件执行此操作。处理几乎肯定会比比较检查慢。

至于生成校验和:您可以使用加密类轻松完成此操作。下面是一个使用 C# 生成 MD5 校验和的简短示例

但是,如果您可以预先计算“测试”或“基本”情况的校验和,则校验和可能会更快并且更有意义。如果您有一个现有文件,并且要检查新文件是否与现有文件相同,则在“现有”文件上预先计算校验和意味着只需要在磁盘上执行一次 DiskIO新文件。这可能比逐字节比较更快。

A checksum comparison will most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

甚是思念 2024-08-10 00:12:27

如果您d̲o̲决定确实需要完整的逐字节比较(请参阅其他答案以了解散列的讨论),那么最简单的解决方案是:

• for `System.String` path names:

public static bool AreFileContentsEqual(String path1, String path2) =>
              File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

• for `System.IO.FileInfo` instances:

public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
    fi1.Length == fi2.Length &&
    (fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
                         File.ReadAllBytes(fi2.FullName)));

与其他一些发布的答案不同,这对于任何类型的文件:二进制、文本、媒体、可执行文件等都是绝对正确的,但作为完整二进制比较在“不重要”方面有所不同的文件(例如BOM , 行尾, 字符编码, 媒体元数据,空格、填充、源代码注释等注释 1)将始终被视为不等于

此代码将两个文件完全加载到内存中,因此它不应该用于比较真正巨大的文件。除了这一重要的警告之外,考虑到 .NET GC(因为它从根本上进行了优化以保持较小的规模,短期分配非常便宜),事实上,当文件大小预计小于85K,因为使用最少的用户代码(如此处所示)意味着最大限度地委托文件性能CLR 的问题,BCLJIT 从(例如)最新的设计技术、系统代码和自适应运行时优化。

此外,对于此类日常场景,通过 LINQ 枚举器(如此处所示)进行逐字节比较的性能的担忧是没有意义的,因为击中磁盘a̲t̲ a̲l̲l̲文件 I/O 将使各种内存比较替代方案的优势相形见绌几个数量级。例如,尽管 SequenceEqual 确实 实际上为我们提供了放弃第一次不匹配的“优化”,但在已经获取了文件的内容,每个内容对于任何真阳性案例都是完全必要的。


1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."

If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:

• for `System.String` path names:

public static bool AreFileContentsEqual(String path1, String path2) =>
              File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

• for `System.IO.FileInfo` instances:

public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
    fi1.Length == fi2.Length &&
    (fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
                         File.ReadAllBytes(fi2.FullName)));

Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.

This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.

Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.


1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."

我也只是我 2024-08-10 00:12:27

除了Reed Copsey的回答之外:

  • 最坏的情况是两个文件相同。在这种情况下,最好逐字节比较文件。

  • 如果两个文件不相同,您可以通过更快地检测到它们不相同来加快速度。

例如,如果两个文件的长度不同,那么您就知道它们不可能相同,甚至不必比较它们的实际内容。

In addition to Reed Copsey's answer:

  • The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.

  • If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.

For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.

ˉ厌 2024-08-10 00:12:27

如果您不读取小 8 字节块,而是放置一个循环,读取更大的块,速度会变得更快。我将平均比较时间减少到 1/4。

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 1024 * sizeof(Int64);
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}

It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 1024 * sizeof(Int64);
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}
踏月而来 2024-08-10 00:12:27

编辑:此方法不适用于比较二进制文件!

在 .NET 4.0 中,File 类具有以下两个新方法:

public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)

这意味着您可以使用:

bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));

Edit: This method would not work for comparing binary files!

In .NET 4.0, the File class has the following two new methods:

public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)

Which means you could use:

bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
安人多梦 2024-08-10 00:12:27

唯一可能使校验和比较比逐字节比较稍快的事实是您一次读取一个文件,这在一定程度上减少了磁盘头的寻道时间。然而,这一微小的增益很可能会被计算哈希值的增加时间所消耗。

此外,如果文件相同,校验和比较当然才有可能更快。如果不是,逐字节比较将在第一个差异处结束,从而使其速度更快。

您还应该考虑到哈希码比较只会告诉您文件很可能是相同的。为了 100% 确定,您需要进行逐字节比较。

例如,如果哈希码是 32 位,则如果哈希码匹配,则大约 99.99999998% 确定文件是相同的。这接近 100%,但如果您确实需要 100% 的确定性,那就不是这样了。

The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.

Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.

You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.

If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.

金橙橙 2024-08-10 00:12:27

我的答案是 @lars 的衍生版本,但修复了对 Stream.Read 的调用中的错误。我还添加了其他答案的一些快速路径检查和输入验证。简而言之,这应该是答案:

using System;
using System.IO;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
        }

        public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return StreamsContentsAreEqual(file1, file2);
                    }
                }
            }
        }

        private static int ReadFullBuffer(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = ReadFullBuffer(stream1, buffer1);
                int count2 = ReadFullBuffer(stream2, buffer2);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

或者如果你想变得超级棒,你可以使用异步变体:

using System;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
        }

        public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
                    }
                }
            }
        }

        private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
                int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

My answer is a derivative of @lars but fixes the bug in the call to Stream.Read. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:

using System;
using System.IO;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
        }

        public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return StreamsContentsAreEqual(file1, file2);
                    }
                }
            }
        }

        private static int ReadFullBuffer(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = ReadFullBuffer(stream1, buffer1);
                int count2 = ReadFullBuffer(stream2, buffer2);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

Or if you want to be super-awesome, you can use the async variant:

using System;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
        }

        public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
                    }
                }
            }
        }

        private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
                int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}
拿命拼未来 2024-08-10 00:12:27

老实说,我认为您需要尽可能地修剪搜索树。

在逐字节检查之前要检查的事项:

  1. 大小是否相同?
  2. 文件 A 中的最后一个字节与文件 B 中的最后一个字节是否不同

另外,一次读取大块会更有效,因为驱动器读取连续字节的速度更快。逐字节进行不仅会导致更多的系统调用,而且如果两个文件位于同一驱动器上,还会导致传统硬盘驱动器的读取头更频繁地来回查找。

将块 A 和块 B 读入字节缓冲区,然后比较它们(不要使用 Array.Equals,请参阅注释)。调整块的大小,直到达到您认为内存和性能之间的良好权衡。您还可以多线程比较,但不要多线程磁盘读取。

Honestly, I think you need to prune your search tree down as much as possible.

Things to check before going byte-by-byte:

  1. Are sizes the same?
  2. Is the last byte in file A different than file B

Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.

Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.

离鸿 2024-08-10 00:12:27

灵感来自 https:// dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9

以下是使用 AVX2 SIMD 指令执行此操作的建议:

using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace FileCompare;

public static class FastFileCompare
{
    public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
    {
        if (fileInfo1.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
        }

        if (fileInfo2.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
        }

        if (fileInfo1.Length != fileInfo2.Length)
        {
            return false;
        }

        if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
        {
            return true;
        }
 
        using FileStream fileStream01 = fileInfo1.OpenRead();
        using FileStream fileStream02 = fileInfo2.OpenRead();
        ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
        byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
        byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
        Array.Fill<byte>(buffer1, 0);
        Array.Fill<byte>(buffer2, 0);
        try
        {
            while (true)
            {
                int len1 = 0;
                for (int read;
                     len1 < buffer1.Length &&
                     (read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
                     len1 += read)
                {
                }

                int len2 = 0;
                for (int read;
                     len2 < buffer1.Length &&
                     (read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
                     len2 += read)
                {
                }

                if (len1 != len2)
                {
                    return false;
                }

                if (len1 == 0)
                {
                    return true;
                }

                unsafe
                {
                    fixed (byte* pb1 = buffer1)
                    {
                        fixed (byte* pb2 = buffer2)
                        {
                            int vectorSize = Vector256<byte>.Count;
                            for (int processed = 0; processed < len1; processed += vectorSize)
                            {
                                Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
                                if (Avx2.MoveMask(result) != -1)
                                {
                                    return false;
                                }
                            }
                        }
                    }
                }
            }
        }
        finally
        {
            sharedArrayPool.Return(buffer1);
            sharedArrayPool.Return(buffer2);
        }
    }
}

Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9

Here is a proposal to do it with AVX2 SIMD instructions:

using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace FileCompare;

public static class FastFileCompare
{
    public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
    {
        if (fileInfo1.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
        }

        if (fileInfo2.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
        }

        if (fileInfo1.Length != fileInfo2.Length)
        {
            return false;
        }

        if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
        {
            return true;
        }
 
        using FileStream fileStream01 = fileInfo1.OpenRead();
        using FileStream fileStream02 = fileInfo2.OpenRead();
        ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
        byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
        byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
        Array.Fill<byte>(buffer1, 0);
        Array.Fill<byte>(buffer2, 0);
        try
        {
            while (true)
            {
                int len1 = 0;
                for (int read;
                     len1 < buffer1.Length &&
                     (read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
                     len1 += read)
                {
                }

                int len2 = 0;
                for (int read;
                     len2 < buffer1.Length &&
                     (read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
                     len2 += read)
                {
                }

                if (len1 != len2)
                {
                    return false;
                }

                if (len1 == 0)
                {
                    return true;
                }

                unsafe
                {
                    fixed (byte* pb1 = buffer1)
                    {
                        fixed (byte* pb2 = buffer2)
                        {
                            int vectorSize = Vector256<byte>.Count;
                            for (int processed = 0; processed < len1; processed += vectorSize)
                            {
                                Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
                                if (Avx2.MoveMask(result) != -1)
                                {
                                    return false;
                                }
                            }
                        }
                    }
                }
            }
        }
        finally
        {
            sharedArrayPool.Return(buffer1);
            sharedArrayPool.Return(buffer2);
        }
    }
}
鯉魚旗 2024-08-10 00:12:27

如果文件不太大,您可以使用:

public static byte[] ComputeFileHash(string fileName)
{
    using (var stream = File.OpenRead(fileName))
        return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

只有当哈希值对于存储有用时才可以比较哈希值。

(将代码编辑为更简洁的内容。)

If the files are not too big, you can use:

public static byte[] ComputeFileHash(string fileName)
{
    using (var stream = File.OpenRead(fileName))
        return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

It will only be feasible to compare hashes if the hashes are useful to store.

(Edited the code to something much cleaner.)

め可乐爱微笑 2024-08-10 00:12:27

我的实验表明,减少调用 Stream.ReadByte() 次数肯定有帮助,但使用 BitConverter 打包字节与比较字节数组中的字节没有太大区别。

因此,可以用最简单的循环替换上面注释中的“Math.Ceiling and iterations”循环:

            for (int i = 0; i < count1; i++)
            {
                if (buffer1[i] != buffer2[i])
                    return false;
            }

我想这与 BitConverter.ToInt64 需要做一些工作这一事实有关(检查参数然后执行位移),然后进行比较,最终的工作量与比较两个数组中的 8 个字节的工作量相同。

My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.

So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:

            for (int i = 0; i < count1; i++)
            {
                if (buffer1[i] != buffer2[i])
                    return false;
            }

I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.

不美如何 2024-08-10 00:12:27

对具有相同长度的大文件的另一个改进可能是不按顺序读取文件,而是比较或多或少的随机块。

您可以使用多个线程,从文件中的不同位置开始并向前或向后进行比较。

通过这种方式,您可以检测文件中间/末尾的更改,比使用顺序方法更快。

Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.

You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.

This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.

月下凄凉 2024-08-10 00:12:27

如果你只需要比较两个文件,我想最快的方法是(在C中,我不知道它是否适用于.NET)

  1. 打开两个文件f1,f2
  2. 获取各自的文件长度l1,l2
  3. 如果l1!= l2 文件不同; stop
  4. mmap() 两个文件都
  5. 在 mmap()ed 文件上使用 memcmp()

OTOH,如果您需要查找一组 N 个文件中是否存在重复文件,那么最快的方法无疑是使用哈希来避免 N-way一点一点的比较。

If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)

  1. open both files f1, f2
  2. get the respective file length l1, l2
  3. if l1 != l2 the files are different; stop
  4. mmap() both files
  5. use memcmp() on the mmap()ed files

OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.

心房的律动 2024-08-10 00:12:27

我认为在某些应用程序中,“散列”比逐字节比较更快。
如果您需要将文件与其他文件进行比较或有可以更改的照片缩略图。
这取决于它在哪里以及如何使用。

private bool CompareFilesByte(string file1, string file2)
{
    using (var fs1 = new FileStream(file1, FileMode.Open))
    using (var fs2 = new FileStream(file2, FileMode.Open))
    {
        if (fs1.Length != fs2.Length) return false;
        int b1, b2;
        do
        {
            b1 = fs1.ReadByte();
            b2 = fs2.ReadByte();
            if (b1 != b2 || b1 < 0) return false;
        }
        while (b1 >= 0);
    }
    return true;
}

private string HashFile(string file)
{
    using (var fs = new FileStream(file, FileMode.Open))
    using (var reader = new BinaryReader(fs))
    {
        var hash = new SHA512CryptoServiceProvider();
        hash.ComputeHash(reader.ReadBytes((int)file.Length));
        return Convert.ToBase64String(hash.Hash);
    }
}

private bool CompareFilesWithHash(string file1, string file2)
{
    var str1 = HashFile(file1);
    var str2 = HashFile(file2);
    return str1 == str2;
}

在这里,您可以获得最快的。

var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

或者,我们可以将哈希值保存在数据库中。

希望这可以帮助

I think there are applications where "hash" is faster than comparing byte by byte.
If you need to compare a file with others or have a thumbnail of a photo that can change.
It depends on where and how it is using.

private bool CompareFilesByte(string file1, string file2)
{
    using (var fs1 = new FileStream(file1, FileMode.Open))
    using (var fs2 = new FileStream(file2, FileMode.Open))
    {
        if (fs1.Length != fs2.Length) return false;
        int b1, b2;
        do
        {
            b1 = fs1.ReadByte();
            b2 = fs2.ReadByte();
            if (b1 != b2 || b1 < 0) return false;
        }
        while (b1 >= 0);
    }
    return true;
}

private string HashFile(string file)
{
    using (var fs = new FileStream(file, FileMode.Open))
    using (var reader = new BinaryReader(fs))
    {
        var hash = new SHA512CryptoServiceProvider();
        hash.ComputeHash(reader.ReadBytes((int)file.Length));
        return Convert.ToBase64String(hash.Hash);
    }
}

private bool CompareFilesWithHash(string file1, string file2)
{
    var str1 = HashFile(file1);
    var str2 = HashFile(file2);
    return str1 == str2;
}

Here, you can get what is the fastest.

var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

Optionally, we can save the hash in a database.

Hope this can help

夏尔 2024-08-10 00:12:27

以下是一些实用函数,可让您确定两个文件(或两个流)是否包含相同的数据。

我提供了一个多线程的“快速”版本,因为它使用任务在不同线程中比较字节数组(每个缓冲区从每个文件中读取的内容填充)。

正如预期的那样,它的速度要快得多(大约快 3 倍),但会消耗更多的 CPU(因为它是多线程的)和更多的内存(因为每个比较线程需要两个字节数组缓冲区)。

    public static bool AreFilesIdenticalFast(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
    }

    public static bool AreFilesIdentical(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdentical);
    }

    public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
    {
        if (path1 == null)
            throw new ArgumentNullException(nameof(path1));

        if (path2 == null)
            throw new ArgumentNullException(nameof(path2));

        if (areStreamsIdentical == null)
            throw new ArgumentNullException(nameof(path2));

        if (!File.Exists(path1) || !File.Exists(path2))
            return false;

        using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        {
            using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (valueFile.Length != thisFile.Length)
                    return false;

                if (!areStreamsIdentical(thisFile, valueFile))
                    return false;
            }
        }
        return true;
    }

    public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)

        var tasks = new List<Task<bool>>();
        do
        {
            // consumes more memory (two buffers for each tasks)
            var buffer1 = new byte[bufsize];
            var buffer2 = new byte[bufsize];

            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
            {
                int read3 = stream2.Read(buffer2, 0, 1);
                if (read3 != 0) // not eof
                    return false;

                break;
            }

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            // consumes more cpu
            var task = Task.Run(() =>
            {
                return IsSame(buffer1, buffer2);
            });
            tasks.Add(task);
        }
        while (true);

        Task.WaitAll(tasks.ToArray());
        return !tasks.Any(t => !t.Result);
    }

    public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)
        var buffer1 = new byte[bufsize];
        var buffer2 = new byte[bufsize];

        var tasks = new List<Task<bool>>();
        do
        {
            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
                return stream2.Read(buffer2, 0, 1) == 0; // check not eof

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            if (!IsSame(buffer1, buffer2))
                return false;
        }
        while (true);
    }

    public static bool IsSame(byte[] bytes1, byte[] bytes2)
    {
        if (bytes1 == null)
            throw new ArgumentNullException(nameof(bytes1));

        if (bytes2 == null)
            throw new ArgumentNullException(nameof(bytes2));

        if (bytes1.Length != bytes2.Length)
            return false;

        for (int i = 0; i < bytes1.Length; i++)
        {
            if (bytes1[i] != bytes2[i])
                return false;
        }
        return true;
    }

Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.

I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.

As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).

    public static bool AreFilesIdenticalFast(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
    }

    public static bool AreFilesIdentical(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdentical);
    }

    public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
    {
        if (path1 == null)
            throw new ArgumentNullException(nameof(path1));

        if (path2 == null)
            throw new ArgumentNullException(nameof(path2));

        if (areStreamsIdentical == null)
            throw new ArgumentNullException(nameof(path2));

        if (!File.Exists(path1) || !File.Exists(path2))
            return false;

        using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        {
            using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (valueFile.Length != thisFile.Length)
                    return false;

                if (!areStreamsIdentical(thisFile, valueFile))
                    return false;
            }
        }
        return true;
    }

    public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)

        var tasks = new List<Task<bool>>();
        do
        {
            // consumes more memory (two buffers for each tasks)
            var buffer1 = new byte[bufsize];
            var buffer2 = new byte[bufsize];

            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
            {
                int read3 = stream2.Read(buffer2, 0, 1);
                if (read3 != 0) // not eof
                    return false;

                break;
            }

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            // consumes more cpu
            var task = Task.Run(() =>
            {
                return IsSame(buffer1, buffer2);
            });
            tasks.Add(task);
        }
        while (true);

        Task.WaitAll(tasks.ToArray());
        return !tasks.Any(t => !t.Result);
    }

    public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)
        var buffer1 = new byte[bufsize];
        var buffer2 = new byte[bufsize];

        var tasks = new List<Task<bool>>();
        do
        {
            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
                return stream2.Read(buffer2, 0, 1) == 0; // check not eof

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            if (!IsSame(buffer1, buffer2))
                return false;
        }
        while (true);
    }

    public static bool IsSame(byte[] bytes1, byte[] bytes2)
    {
        if (bytes1 == null)
            throw new ArgumentNullException(nameof(bytes1));

        if (bytes2 == null)
            throw new ArgumentNullException(nameof(bytes2));

        if (bytes1.Length != bytes2.Length)
            return false;

        for (int i = 0; i < bytes1.Length; i++)
        {
            if (bytes1[i] != bytes2[i])
                return false;
        }
        return true;
    }
墟烟 2024-08-10 00:12:27

我发现这很有效,首先比较长度而不读取数据,然后比较读取的字节序列

private static bool IsFileIdentical(string a, string b)
{            
   if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
   return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}

This I have found works well comparing first the length without reading data and then comparing the read byte sequence

private static bool IsFileIdentical(string a, string b)
{            
   if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
   return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}
黎夕旧梦 2024-08-10 00:12:27

另一个答案,来自@chsh。 MD5 的用法和快捷方式适用于文件相同、文件不存在和不同长度:

/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
    if (file1 == file2)
        return true;

    FileInfo file1Info = new FileInfo(file1);
    FileInfo file2Info = new FileInfo(file2);

    if (!file1Info.Exists && !file2Info.Exists)
       return true;
    if (!file1Info.Exists && file2Info.Exists)
        return false;
    if (file1Info.Exists && !file2Info.Exists)
        return false;
    if (file1Info.Length != file2Info.Length)
        return false;

    using (FileStream file1Stream = file1Info.OpenRead())
    using (FileStream file2Stream = file2Info.OpenRead())
    { 
        byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
        byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
        for (int i = 0; i < firstHash.Length; i++)
        {
            if (i>=secondHash.Length||firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
}

Yet another answer, derived from @chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:

/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
    if (file1 == file2)
        return true;

    FileInfo file1Info = new FileInfo(file1);
    FileInfo file2Info = new FileInfo(file2);

    if (!file1Info.Exists && !file2Info.Exists)
       return true;
    if (!file1Info.Exists && file2Info.Exists)
        return false;
    if (file1Info.Exists && !file2Info.Exists)
        return false;
    if (file1Info.Length != file2Info.Length)
        return false;

    using (FileStream file1Stream = file1Info.OpenRead())
    using (FileStream file2Stream = file2Info.OpenRead())
    { 
        byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
        byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
        for (int i = 0; i < firstHash.Length; i++)
        {
            if (i>=secondHash.Length||firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
}
删除会话 2024-08-10 00:12:27

不是一个真正的答案,但有点有趣。
这是 github 的 CoPilot (AI) 建议的 :-)

public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
    if (actualFile.Length != expectedFile.Length) {
        throw new Exception($"File {actualFile.Name} has different length in actual and expected directories.");
    }

    // compare the files on a byte level
    using var actualStream   = actualFile.OpenRead();
    using var expectedStream = expectedFile.OpenRead();
    var       actualBuffer   = new byte[1024];
    var       expectedBuffer = new byte[1024];
    int       actualBytesRead;
    int       expectedBytesRead;
    do {
        actualBytesRead   = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
        expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
        if (actualBytesRead != expectedBytesRead) {
            throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
        }

        if (!actualBuffer.SequenceEqual(expectedBuffer)) {
            throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
        }
    } while (actualBytesRead > 0);
}

我发现 SequenceEqual 的用法特别有趣。

Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)

public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
    if (actualFile.Length != expectedFile.Length) {
        throw new Exception(
quot;File {actualFile.Name} has different length in actual and expected directories.");
    }

    // compare the files on a byte level
    using var actualStream   = actualFile.OpenRead();
    using var expectedStream = expectedFile.OpenRead();
    var       actualBuffer   = new byte[1024];
    var       expectedBuffer = new byte[1024];
    int       actualBytesRead;
    int       expectedBytesRead;
    do {
        actualBytesRead   = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
        expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
        if (actualBytesRead != expectedBytesRead) {
            throw new Exception(
quot;File {actualFile.Name} has different content in actual and expected directories.");
        }

        if (!actualBuffer.SequenceEqual(expectedBuffer)) {
            throw new Exception(
quot;File {actualFile.Name} has different content in actual and expected directories.");
        }
    } while (actualBytesRead > 0);
}

I find the usage of SequenceEqual particular interesting.

朱染 2024-08-10 00:12:27

(希望)相当有效的东西:

public class FileCompare
{
    public static bool FilesEqual(string fileName1, string fileName2)
    {
        return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="file1"></param>
    /// <param name="file2"></param>
    /// <param name="bufferSize">8kb seemed like a good default</param>
    /// <returns></returns>
    public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
    {
        if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;

        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        using var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read);
        using var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read);

        while (true)
        {
            var bytesRead1 = ReallyRead(stream1, buffer1, 0, bufferSize);
            var bytesRead2 = ReallyRead(stream2, buffer2, 0, bufferSize);

            if (bytesRead1 != bytesRead2) return false;
            if (bytesRead1 == 0) return true;
            if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
        }
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="array1"></param>
    /// <param name="array2"></param>
    /// <param name="bytesToCompare"> 0 means compare entire arrays</param>
    /// <returns></returns>
    public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
    {
        if (array1.Length != array2.Length) return false;

        var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
        var tailIdx = length - length % sizeof(Int64);

        //check in 8 byte chunks
        for (var i = 0; i < tailIdx; i += sizeof(Int64))
        {
            if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
        }

        //check the remainder of the array, always shorter than 8 bytes
        for (var i = tailIdx; i < length; i++)
        {
            if (array1[i] != array2[i]) return false;
        }

        return true;
    }
    
    private static int ReallyRead(FileStream src, byte[] buffer, int offset, int count){
        int bytesRead = 0;
        do{
            var currentBytesRead = src.Read(buffer, bytesRead, count);
            if(currentBytesRead == 0){
                return Math.Max(0, bytesRead);
            }
            count -= currentBytesRead;
            bytesRead += currentBytesRead;
        }while(count > 0);
        return bytesRead;
    }
}

Something (hopefully) reasonably efficient:

public class FileCompare
{
    public static bool FilesEqual(string fileName1, string fileName2)
    {
        return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="file1"></param>
    /// <param name="file2"></param>
    /// <param name="bufferSize">8kb seemed like a good default</param>
    /// <returns></returns>
    public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
    {
        if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;

        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        using var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read);
        using var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read);

        while (true)
        {
            var bytesRead1 = ReallyRead(stream1, buffer1, 0, bufferSize);
            var bytesRead2 = ReallyRead(stream2, buffer2, 0, bufferSize);

            if (bytesRead1 != bytesRead2) return false;
            if (bytesRead1 == 0) return true;
            if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
        }
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="array1"></param>
    /// <param name="array2"></param>
    /// <param name="bytesToCompare"> 0 means compare entire arrays</param>
    /// <returns></returns>
    public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
    {
        if (array1.Length != array2.Length) return false;

        var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
        var tailIdx = length - length % sizeof(Int64);

        //check in 8 byte chunks
        for (var i = 0; i < tailIdx; i += sizeof(Int64))
        {
            if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
        }

        //check the remainder of the array, always shorter than 8 bytes
        for (var i = tailIdx; i < length; i++)
        {
            if (array1[i] != array2[i]) return false;
        }

        return true;
    }
    
    private static int ReallyRead(FileStream src, byte[] buffer, int offset, int count){
        int bytesRead = 0;
        do{
            var currentBytesRead = src.Read(buffer, bytesRead, count);
            if(currentBytesRead == 0){
                return Math.Max(0, bytesRead);
            }
            count -= currentBytesRead;
            bytesRead += currentBytesRead;
        }while(count > 0);
        return bytesRead;
    }
}
仙女山的月亮 2024-08-10 00:12:27

我喜欢上面的 SequenceEqual 答案,但哈希比较答案看起来非常混乱。我更喜欢这样的哈希比较:

    public bool AreFilesEqual(string file1Path, string file2Path)
    {
        string file1Hash = "", file2Hash = "";
        SHA1 sha = new SHA1CryptoServiceProvider();

        using (FileStream fs = File.OpenRead(file1Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file1Hash = Convert.ToBase64String(hash);
        }

        using (FileStream fs = File.OpenRead(file2Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file2Hash = Convert.ToBase64String(hash);
        }

        return (file1Hash == file2Hash);
    }

I liked the SequenceEqual answers above, but the hash comparison answers looked very messy. I prefer a hash comparison more like this:

    public bool AreFilesEqual(string file1Path, string file2Path)
    {
        string file1Hash = "", file2Hash = "";
        SHA1 sha = new SHA1CryptoServiceProvider();

        using (FileStream fs = File.OpenRead(file1Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file1Hash = Convert.ToBase64String(hash);
        }

        using (FileStream fs = File.OpenRead(file2Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file2Hash = Convert.ToBase64String(hash);
        }

        return (file1Hash == file2Hash);
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文