另一个类代替 SHA1 管理使校验和的长度小于 128 个字节

发布于 2024-09-25 18:11:42 字数 1487 浏览 5 评论 0原文

我有一个表,其中有一列 (AbsoluteUrl NVARCHAR(2048)),并且我想查询该列,因此需要很长时间才能将每条记录与我自己的字符串进行比较。这张表至少有1000000条记录。

现在我认为有更好的解决方案来为每个 AbsoluteUrl 制作校验和并将其与校验和一起比较而不是与 AbsoluteUrl 列进行比较。所以我使用下面的方法来生成校验和。但我想要另一个类来制作长度小于 128 个字节的校验和。

public static byte[] GenerateChecksumAsByte(string content)
    {
        var buffer = Encoding.UTF8.GetBytes(content);
        return new SHA1Managed().ComputeHash(buffer);
    }

这种方法对我的工作有好处吗?

更新

根据答案,我想更深入地解释。所以实际上我正在研究非常简单的网络搜索引擎。如果我想简单地解释一下,我必须说,当提取网页的所有 url 时(找到的 url 的集合),那么我将把它索引到 Urls 表中。

UrlId uniqueidentifier NotNull 主键(聚集索引) AbsoluteUrl nvarchar(2048) NoyNull Checksum varbinary(128) NotNull

所以我首先搜索表是否有相同的 url 之前已索引或未索引。如果没有则创建新记录。

public Url Get(byte[] checksum)
    {
        return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum);
        //Or querying by AbsoluteUrl field
   }

和保存方法。

public void Save(Url url)
    {
        if (url == null)
            throw new ArgumentNullException("url");
        var origin = _dataContext.Urls.GetOriginalEntityState(url);
        if (origin == null)
        {
            _dataContext.Urls.Attach(url);
            _dataContext.Refresh(RefreshMode.KeepCurrentValues, url);
        }
        else
            _dataContext.Urls.InsertOnSubmit(url);
        _dataContext.SubmitChanges();
    }

例如,如果我在一页上找到 2000 个 url,我必须搜索 2000 次。

i have a table that have one column (AbsoluteUrl NVARCHAR(2048)) and i want to querying on this column, so this took long time to comparing each records with my own string. at least this table have 1000000 records.

Now i think there is better solution to making a checksum for each AbsoluteUrl and compare to checksum together instead of to AbsoluteUrl column. so i'm use below method to generate checksum. but i want another class to making checksum's with fewer than 128 length bytes.

public static byte[] GenerateChecksumAsByte(string content)
    {
        var buffer = Encoding.UTF8.GetBytes(content);
        return new SHA1Managed().ComputeHash(buffer);
    }

And is this approach good for my work?

UPDATE

According to answers, i want to explain in more depth. so actually I'm work on very simple Web Search Engine. If I want to briefly explain that I have to say when all of urls of web page are extracted (collection of found urls) then I'm going to index that to Urls table.

UrlId uniqueidentifier NotNull Primary Key (Clustered Index)
AbsoluteUrl nvarchar(2048) NoyNull
Checksum varbinary(128) NotNull

So i first search the table to if i have same url which is indexed before or not. if not then create new record.

public Url Get(byte[] checksum)
    {
        return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum);
        //Or querying by AbsoluteUrl field
   }

And Save method.

public void Save(Url url)
    {
        if (url == null)
            throw new ArgumentNullException("url");
        var origin = _dataContext.Urls.GetOriginalEntityState(url);
        if (origin == null)
        {
            _dataContext.Urls.Attach(url);
            _dataContext.Refresh(RefreshMode.KeepCurrentValues, url);
        }
        else
            _dataContext.Urls.InsertOnSubmit(url);
        _dataContext.SubmitChanges();
    }

For example if on one page i found 2000 urls, i must search for 2000 times.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

是伱的 2024-10-02 18:11:45

我同意史蒂文的观点,您应该首先在该字段上尝试索引,看看它是否真的是“比较每个记录”才是瓶颈。

但是,根据您的数据库,对 NVARCHAR(2048) 建立索引可能是不可能的,而且实际上可能是瓶颈。在这种情况下,如果出现以下情况,生成校验和实际上可以提高搜索性能:

  1. 您执行的比较次数比插入次数多得多。
  2. 比较校验和比比较 NVARCHAR 更快。
  3. 大多数校验和都是不同的。

您没有向我们展示任何查询或示例数据,因此我无法知道这些是否属实。如果它们是真的,您确实可以通过为每个 AbsoluteUrl 生成校验和并假设这些校验和不同的值不同来提高性能。如果校验和相同,则必须进行字符串比较以查看值是否匹配,但如果校验和不同,则可以确定字符串不同。

在这种情况下,不需要加密校验和,您可以使用更小、更快的校验和算法,例如 CRC64

正如史蒂文指出的,如果你的校验和相同,你就不能假设你的值是相同的。但是,如果您的大多数值都不同并且您有一个良好的校验和,则您的大多数校验和将不同并且不需要字符串比较。

I agree with Steven that you should first try an index on the field to see if it really is "comparing each records" that is the bottleneck.

However, depending on your database, indexing an NVARCHAR(2048) may not be possible, and really could be the bottleneck. In that case generating checksums actually could improve your search performance if:

  1. You do many more comparisons than inserts.
  2. Comparing the checksum is faster than comparing NVARCHARs.
  3. Most of your checksums are different.

You have not shown us any queries or sample data, so I have no way of knowing if these are true. If they are true, you can indeed improve performance by generating a checksum for each AbsoluteUrl and assuming values are different where these checksums are different. If the checksums are the same, you will have to do a string comparison to see if values match, but if checksums are different you can be sure the strings are different.

In this case a cryptographic checksum is not necessary, you can use a smaller, faster checksum algorithm like CRC64.

As Steven points out, if your checksums are the same you cannot assume your values are the same. However, if most of your values are different and you have a good checksum, most of your checksums will be different and will not require string comparisons.

樱花坊 2024-10-02 18:11:45

不,这不是一个好方法。

对于索引字段来说,一百万条记录没什么大不了的。另一方面,由于鸽巢原理(又名生日悖论),任何校验和/哈希/您生成的任何内容都可能出现误报。让它变大会减少但不会消除这种机会,但它确实会减慢速度,直至速度不再增加。

只需在字段上添加一个索引,看看会发生什么。

No, this is not a good approach.

A million records is no big deal for an indexed field. On the other hand, any checksum/hash/whatever you generate is capable of false positives due to the pigeonhole principle (aka birthday paradox). Making it bigger reduces but does not eliminate this chance, but it does slow things down to the point where there will be no speed increase.

Just slap an index on the field and see what happens.

南薇 2024-10-02 18:11:44

您想要使用大小为 (p) 的散列作为键,预计最多 1m 条记录 (u)。要回答这个问题,您必须首先进行数学计算...

针对要考虑的每个哈希大小解决以下问题: 1 - e ^ (-u^2 / (2 * p))

  • 32 位:100% 碰撞机会
  • 64 -位:0.00000271% 碰撞机会
  • 128 位:0%(太小,无法用双精度计算)

现在您应该有足够的信息来做出明智的决定。以下是在 64 位密钥上生成上述计算的代码:

double keySize = 64;
double possibleKeys = Math.Pow(2, keySize);
double universeSize = 1000000;
double v1, v2;
v1 = -Math.Pow(universeSize, 2);
v2 = 2.0 * possibleKeys;
v1 = v1 / v2;
v1 = Math.Pow(2.718281828, v1);
v1 = 1.0 - v1;
Console.WriteLine("The resulting percentage is {0:n40}%", v1 * 100.0);

就我个人而言,我自己至少坚持使用 128 位哈希。此外,如果冲突可能导致任何形式的安全漏洞,您至少需要使用 v2 SHA 哈希 (SHA256/SHA512)。

现在,如果这只是数据库的优化,请考虑以下事项:

  1. 向表中添加 32 位哈希码。
  2. 创建一个包含 32 位哈希值和原始字符串的复合密钥。
  3. 始终在哈希值和原始字符串上查找。
  4. 假设哈希只是一种优化,而不是唯一的。

You want to use a hash of size (p) as a key, expecting at most 1m records (u). To answer this question you have to first do the math...

Solve the following for each hash size to consider: 1 - e ^ (-u^2 / (2 * p))

  • 32-bit: 100% chance of collision
  • 64-bit: 0.00000271% chance of collision
  • 128-bit: 0% (too small to calculate with a double precision)

Now you should have enough information to make an informed decision. Here is the code to produce the above calculation on the 64-bit key:

double keySize = 64;
double possibleKeys = Math.Pow(2, keySize);
double universeSize = 1000000;
double v1, v2;
v1 = -Math.Pow(universeSize, 2);
v2 = 2.0 * possibleKeys;
v1 = v1 / v2;
v1 = Math.Pow(2.718281828, v1);
v1 = 1.0 - v1;
Console.WriteLine("The resulting percentage is {0:n40}%", v1 * 100.0);

Personally I'd stick with at least a 128 bit hash myself. Moreover if collisions can cause any form of a security hole you need to use at least a v2 SHA hash (SHA256/SHA512).

Now, If this is just an optimization for the database consider the following:

  1. add a 32-bit hash code to the table.
  2. create a composite key containing both the 32-bit hash AND the original string.
  3. ALWAYS seek on both the hash and the original string.
  4. Assume the hash is only an optimization and never unique.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文