对不是很长的字符串向量(url)进行哈希处理的最佳方法是什么?

发布于 2024-11-02 17:29:26 字数 241 浏览 0 评论 0 原文

我现在正在处理url分类。我用“/?”等对 url 进行分区,生成一堆部分。在这个过程中,我需要将第一部分哈希到第k部分,比如说k=2,那么对于“http://stackoverflow.com/questions/ask”,关键是一个字符串向量“stackoverflow.com questions” 。目前,哈希就像Hash。但它会消耗大量内存。我想知道 MD5 是否有帮助或者是否有其他替代方案。实际上,我不需要精确地恢复密钥,只要区分不同的密钥即可。 谢谢!

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

儭儭莪哋寶赑 2024-11-09 17:29:26

消耗大量内存

如果您的代码已经可以运行,您可能需要考虑保持原样。如果你没有目标,你就不知道什么时候完成。您确定在您的情况下“很多”与“太多”同义吗?

如果您决定确实需要更改工作代码,则应该考虑可用的各种选项,而不是听信某人对特定算法的说法:

不确定内存影响,它肯定会改变你的性能配置文件,但你也可以考虑使用 Tries:

http://en.wikipedia.org/wiki/Trie

It consumes a lot of memory

If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?

If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:

etc

Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:

http://en.wikipedia.org/wiki/Trie

白芷 2024-11-09 17:29:26

MD5 是一个很好的哈希码,适用于安全性不成问题的情况。它速度快且长度相当长(128 位对于大多数应用程序来说就足够了)。而且分布也非常好。

Adler32 将是一个可能的替代方案。实现起来非常简单,只需几行代码。它甚至比 MD5 还要快。对于许多应用程序来说它足够长/足够好(尽管对于许多应用程序来说不是)。
(我知道 Adler32 严格来说不是哈希码,但对于许多应用程序来说它仍然可以很好地工作)

但是,如果存储哈希码消耗大量内存,您可以随时截断哈希码,或使用 XOR 来“缩小”它。例如

uint8_t md5[16];
GetMD5(md5, ...);

// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
    md5[i % 4] ^= md5[i];

// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);

,我个人认为 MD5 有点过分了。看看 Adler32,我想就可以了。


编辑

我必须纠正自己:对于短字符串(少于几千字节),Adler23 是一个相当糟糕的选择。我完全忘记了这一点。但总有一个显而易见的事实:CRC32。不如 Adler23 快(与 MD5 速度大致相同),但仍然易于实施,而且还有大量具有各种许可证的现有实施。

MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.

Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)

However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.

uint8_t md5[16];
GetMD5(md5, ...);

// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
    md5[i % 4] ^= md5[i];

// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);

Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.


EDIT

I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.

叹倦 2024-11-09 17:29:26

如果您只是想查明两个 URL 是否相同,您是否考虑过存储服务器 IP 地址的二进制版本?如果两个服务器名称解析为同一地址,这是错误的还是对您的应用程序有利?

If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文