考虑 RAM 的 url 或哈希索引
我正在开发一个项目,每天需要添加/更新大约 100 万个网址。有些日子主要是更新,有些日子主要是添加,有些日子是混合的。
因此,在每个查询中都需要在 url 表中查找 url 的唯一性。
如何使 url 查找变得非常快,因为目前索引设置在 url 列中并且效果很好,但在未来几周内,如果索引保留在同一列上,并且将添加数百万新记录,则 RAM 将不够用。
这就是为什么我正在寻找一个解决方案,以便当总共有 150+ 万个 url 时,它的查找应该很快。我正在考虑在 md5 上创建索引,但随后担心发生冲突。一位朋友建议我也计算 crc32 哈希并与 md5 连接以使冲突可能性为零并将其存储在二进制(20)中,这样只有 20 个字节将被视为索引,而不是当前设置为 url 列数据的 255 个 varchar(255)类型。
目前总共有大约 5000 万个 URL,8GB RAM 运行良好。
昨天,我问了一个问题 url 文本压缩(不是缩短)和存储在与同一项目相关的mysql中。
[编辑] 我想到了另一种解决方案,即仅将 crc32 哈希值设置为十进制形式,以加快查找速度。在应用程序级别移植检查返回的记录数。如果返回多于 1 条记录,则还应匹配确切的 url。 这样,通过为每行存储 4 个字节而不是 20 个字节 (md5+crc32),也可以避免冲突,同时保持 RAM 和磁盘空间的低负载。你说的话?
I am working on a project which needs to add/update around 1 million urls daily. Some days are mostly updates and some days are mostly add and some days are mix.
So, on every query there is need to look up uniqueness of url in url table.
How look up for url can be made really fast because at the moment index is set at url column and it works good but in coming weeks RAM would not be enough if index are kept on same column and new records will be added in millions.
That's why I am looking for a solution so that when there will be 150+ million urls in total then its look up should be fast. I am thinking of creating indexing on md5 but then worries about collision chances. A friend tipped me to calculate crc32 hash also and concatenate with md5 to make collision possibility to zero and store it in binary(20) that way only 20 bytes would be taken as index instead of 255 currently varchar(255) set as url column data type.
Currently there are total around 50 million urls and with 8GB ram its working fine.
Yesterday, I asked a question url text compression (not shortening) and storing in mysql related to the same project.
[Edit]
I have thought of another solution of putting crc32 hash only in decimal form to speed up look up. And at application level porting a check on how many records are returned. If more than 1 record is returned then exact url should also be matched.
That way collision would also be avoided while keep low load on RAM and disk space by storing 4 bytes for each row instead of 20 bytes (md5+crc32). What you say?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
阅读完您的所有问题后( 独特的约束使散列无用?,512 位哈希与 4 128 位hash 和 url 文本压缩(不缩短)和存储在 mysql 中),我了解您的问题或多或少如下:
是这样吗?
以下几点很重要:
您要保存的 URL 的格式如何?您是否需要读回 URL,或者只是更新有关它的信息,但永远不要基于部分 URL 进行搜索等?
假设 URL =“http://www.somesite.com.tv/images/picture01。 jpg”并且您想要存储所有内容,包括文件名。 如果不同,请提供更多详细信息或更正我的答案假设。
是否可以通过替换 URL 中的某些字符组来节省空间。并非所有 ASCII 字符在 URL 中都有效,如您在此处看到的:RFC1738,因此您可以使用它们来表示(并压缩)URL。例如:用字符0x81表示“http://”可以节省6个字符,用0x82表示“.jpg”可以再节省3个字节等等。
有些单词可能很常见(例如“image”) ”、“图片”、“视频”、“用户”)。如果您选择使用字符 0x90 到 0x9f + 任何其他字符(例如 0x90 0x01、0x90 0x02、0x90 0xfa)来编码此类单词,则可以使用 16 * 256 = 4,096 个“字典条目”来编码最常用的单词。您将使用 2 个字节来表示 4 - 8 个字符。
编辑:正如您在上面提到的 RFC 中所读到的,在 URL 中只能包含可打印的 ASCII 字符。这意味着仅应使用字符 0x20 到 0x7F,并在 RFC 中进行了一些观察。因此,不应使用 0x80(十六进制表示法,在 ASCII 表中的十进制字符 128)之后的任何字符。因此,如果可以选择一个字符(假设是 0x90)作为一个标志来指示“接下来的字节是字典中的指示,即我将使用的索引”。 1 个字符 (0x90) * 256 个字符(0x00 至 0xFF)= 字典中的 256 个条目。但您也可以选择使用字符 0x90 到 0x9f(或十进制的 144 到 159)来表示它们是字典的标志,从而为您提供 16 *256 种可能性...
这 2 种方法可以为您节省大量空间在您的数据库中,并且是可逆的,无需担心冲突等。您只需在应用程序中创建一个字典,然后使用它对 URL 进行编码/解码,速度非常快,使您的数据库变得更轻。
由于您已经拥有+50M URL,您可以根据它们生成统计信息,以生成更好的字典。
使用哈希:在这种情况下,哈希是大小和安全性之间的权衡。如果发生碰撞,后果会有多严重?
在这种情况下,您可以使用生日悖论来帮助您。
阅读该文章以了解该问题:如果所有输入(URL 中可能的字符)都相同,您可以估计发生冲突的概率。并且可以计算出相反的结果:给定您可接受的冲突概率和文件数量,您的范围应该有多宽?而且由于您的范围与哈希函数生成的位数完全相关...
编辑:如果您有一个为您提供 128 位的哈希函数,那么您将有 2^128 种可能的结果。因此,生日悖论中的“范围”是 2^128:就像您的一年有 2^128 天,而不是 365 天。因此,您计算碰撞的概率(“两个文件是 < em>出生在同一天,年有 2^128 天而不是 365 天)。给你 512 位,你的范围从 0 到 2^512...
并且,再次记住 RFC:并非所有字节(256 个字符)在互联网/URL 世界中都是有效的。减少碰撞对你来说更好:)。
After reading all your questions ( unique constraint makes hashes useless? , 512 bit hash vs 4 128bit hash and url text compression (not shortening) and storing in mysql), I understood that your problem is more or less the following:
Is that it?
The following points are important:
How is the format of the URL that you'll save? Will you need to read the URL back, or just update informations about it, but never search based in partial URLs, etc?
Assuming URL = "http://www.somesite.com.tv/images/picture01.jpg" and that you want to store everything, inclusing the filename. If it's different, please provide more details or correct my answer assumptions.
If can save space by replacing some group of characters in the URL. Not all ASCII characters are valid in an URL, as you can see here: RFC1738, so you can use those to represent (and compress) the URL. For example: using character 0x81 to represent "http://" can make you save 6 characters, 0x82 to represent ".jpg" can save you another 3 bytes, etc.
Some words might be very common (like "image", "picture", "video", "user"). If you choose to user characters 0x90 up to 0x9f + any other character (so, 0x90 0x01, 0x90 0x02, 0x90 0xfa) to encode such words, you can have 16 * 256 = 4,096 "dictionary entries" to encode the most used words. You'll use 2 bytes to represent 4 - 8 characters.
Edit: as you can read in the mentioned RFC, above, in the URL you can only have the printable ASCII characters. This means that only characters 0x20 to 0x7F should be used, with some observations made in the RFC. So, any character after 0x80 (hexadecimal notation, would be character 128 decimal in the ASCII table) shouldn't be used. So, if can choose one character (let's say the 0x90) to be one flag to indicate "the following byte is a indication in the dictionary, the index that I'll use". One character (0x90) * 256 characters (0x00 up to 0xFF) = 256 entries in the dictionary. But you can also choose to use characters 0x90 to 0x9f (or 144 to 159 in decimal) to indicate that they are a flag to the dictionary, thus giving you 16 *256 possibilities...
These 2 methods can save you a lot of space in your database and are reversible, without the need to worry about collisions, etc. You'll simple create a dictionary in your application and go encode/decode URLs using it, very fast, making your database much lighter.
Since you already have +50M URLs, you can generate statistics based on them, to generate a better dictionary.
Using hashes : Hashes, in this case, are a tradeoff between size and security. How bad will it be if you get a collision?
And in this case you can use the birthday paradox to help you.
Read the article to understand the problem: if all inputs (possible characters in the URL) were equivalent, you could stimate the probability of a collision. And could calculate the opposite: given your acceptable collision probability, and your number of files, how broad should your range be? And since your range is exactlly related to the number of bits generated by the hash function...
Edit: if you have a hash function that gives you 128 bits, you'll have 2^128 possible outcomes. So, your "range" in the birthday paradox is 2^128: it's like your year have 2^128 days, instead of 365. So, you calculate the probabilities of collision ("two files being born in the same day, with a year that have 2^128 days instead of 365 days). If you choose to use a hash that gives you 512 bits, your range would go from 0 to 2^512...
And, again, have the RFC in mind: not all bytes (256 characters) are valid in the internet / URL world. So, the probabillity of collisions decrease. Better for you :).