对于读写密集型应用程序,在 mysql 中存储 url 的最佳方法
对于读写密集型应用程序来说,在 mysql 中有效存储 url 的最佳方法是什么?
我将存储超过 500,000 个网址(全部以 http:// 或 https:// 开头。没有其他协议)并将整个 url (http://example.com/path/?variable=a) 保存到一个列似乎很大程度上是多余的,因为相同的域名和路径将多次保存到 mysql 中。
因此,最初,我想到将它们分解(即域、路径和变量等)以消除冗余。但我看到一些帖子说不推荐。对此有什么想法吗?
此外,应用程序通常必须检索没有主键的 url,这意味着它必须搜索文本才能检索 url。 URL 可以被索引,但我想知道如果存储整个 url 和分解的 url 都在 innodb 下建立索引(无全文索引),那么性能差异有多大。
Broken-down-url 必须执行额外的步骤来组合它们。另外,这意味着我必须从不同的表(协议、域、路径、变量)检索数据 4 次,但这也使得每行中存储的数据更短,并且每个表中的行数会更少。这可能会加快这个过程吗?
What is the best way to store url in mysql effectively for a read&write-intensive application?
I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.
So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?
Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).
Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我已经广泛地处理过这个问题,我的总体理念是使用使用频率方法。虽然很麻烦,但它可以让您对数据进行一些出色的分析:
作为一般规则,您将在单个域上具有相似的路径,但每个路径具有不同的查询字符串。
我最初设计它的目的是将所有部分都索引在一个表中(协议、域、路径、查询字符串),但我认为上述内容占用的空间较少,并且可以更好地从中获取更好的数据。
text
往往很慢,因此您可以在使用后将“Path”更改为 varchar。大多数服务器在 URL 大约 1K 后就死机了,但我见过一些大型服务器,并且会在不丢失数据的情况下犯错误。您的检索查询很麻烦,但如果您将其抽象到代码中,则没有问题:
如果端口号不是标准的(http 为非 80,https 为非 443),则存储端口号,否则将其存储为 NULL 以表示它应该不被包括在内。 (您可以将逻辑添加到 MySQL,但它会变得更加丑陋。)
我总是(或从不)从路径中删除“/”以及“?”从 QueryString 中获取以节省空间。只有损失才能区分
哪个,如果重要的话,那么我会改变你的策略,永远不要剥离它,而只是包含它。从技术上讲,
它们是相同的,因此删除路径斜杠始终是可以的。
因此,要解析:
我们将在 PHP 中使用类似
parse_url
的东西来生成:然后您将检查/插入(使用适当的锁,未显示):(
如果不存在)
然后我们就得到了 < code>$DomainID 继续...
然后插入到 DomainPath:(
如果不存在,则以类似方式插入)
然后我们将
$DomainPathID
继续...并插入如果需要的话。
现在,让我指出重要,对于高性能网站来说,上述方案会很慢。您应该修改所有内容以使用某种哈希来加速
SELECT
。简而言之,该技术就像:我故意从上面删除它以保持简单,但是将文本与另一个文本进行选择的比较很慢,并且对于非常长的查询字符串会中断。也不要使用固定长度索引,因为这也会破坏。对于精度很重要的任意长度字符串,哈希失败率是可以接受的。
最后,如果可以的话,在客户端执行 MD5 哈希,以节省向服务器发送大型 blob 来执行 MD5 操作。大多数现代语言都支持内置 MD5:
但我离题了。
I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:
As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.
I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.
text
tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:
Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)
I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between
Which, if important, then I would change your tack to never strip it and just include it. Technically,
Are the same, so stripping the Path slash is OK always.
So, to parse:
We would use something like
parse_url
in PHP to produce:You would then check/insert (with appropriate locks, not shown):
(if doesn't exist)
We then have our
$DomainID
going forward...Then insert into DomainPath:
(if it doesn't exist, insert it similarly)
We then have our
$DomainPathID
going forward...and insert if necessary.
Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up
SELECT
s. In short, the technique is like:I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.
Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:
But I digress.
这实际上取决于您想对数据做什么。如果您正在对 URL 进行统计,例如查看哪些域最受欢迎,那么通过它进行细分是可行的。但如果您只是存储它们并访问整个 url,则没有理由将它们分开。
我见过一些人对长字符串(例如 md5)进行哈希处理并对其进行搜索,URL 的性能可能会有所提高,但我不确定提高了多少(对于大量文本来说更好)。
无论你做什么 - 不要忘记尽可能使用整数作为主键,因为它们是最快的查找。
如果你真的想分割你的URL,你可能需要考虑保留单独的表,以免锁定你的表(在innoDB中,这并不重要,因为表不会被锁定),但是使用单独的表,你可以使用foreign/primary_keys/ints 来引用您需要的行。
值得一读的是 friendfeed 的博客文章 - 这可能会给你一些想法出色地:
It really depends on what you want to do with the data. if you're doing statistics with the URLs e.g. to see what domains are the most popular, then it would be feasible to break it down through that. But if you are just storing them and accessing the url in it's entirety, there's no reason to split them up.
I've seen some people hashing long strings (e.g. md5) and searching against that, there might be a performance enhancement for URLs, but I'm not certain by how much (it's better for lots of text).
Whatever you do - don't forget to always use ints as primary keys as much as possible as those are the quickest lookups.
If you really want to split your URLs, you may want to consider keeping separate tables in order to not lock your table (in innoDB it doesn't matter as the table doesn't get locked up) but with seperate tables, you can just use foreign/primary_keys/ints to reference the rows you need.
A good read is friendfeed's blog entry - that might give you some ideas as well: