对于读写密集型应用程序，在 mysql 中存储 url 的最佳方法

发布于 2024-10-19 18:43:56 字数 500 浏览 3 评论 0原文

对于读写密集型应用程序来说，在 mysql 中有效存储 url 的最佳方法是什么？

我将存储超过 500,000 个网址（全部以 http:// 或 https:// 开头。没有其他协议）并将整个 url (http://example.com/path/?variable=a) 保存到一个列似乎很大程度上是多余的，因为相同的域名和路径将多次保存到 mysql 中。

因此，最初，我想到将它们分解（即域、路径和变量等）以消除冗余。但我看到一些帖子说不推荐。对此有什么想法吗？

此外，应用程序通常必须检索没有主键的 url，这意味着它必须搜索文本才能检索 url。 URL 可以被索引，但我想知道如果存储整个 url 和分解的 url 都在 innodb 下建立索引（无全文索引），那么性能差异有多大。

Broken-down-url 必须执行额外的步骤来组合它们。另外，这意味着我必须从不同的表（协议、域、路径、变量）检索数据 4 次，但这也使得每行中存储的数据更短，并且每个表中的行数会更少。这可能会加快这个过程吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

末が日狂欢 2024-10-26 18:43:56

我已经广泛地处理过这个问题，我的总体理念是使用使用频率方法。虽然很麻烦，但它可以让您对数据进行一些出色的分析：

CREATE TABLE URL (
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   DomainPath    integer unsigned NOT NULL,
   QueryString   text
) Engine=MyISAM;

CREATE TABLE DomainPath (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Domain        integer unsigned NOT NULL,
   Path          text,
   UNIQUE (Domain,Path)
) Engine=MyISAM;

CREATE TABLE Domain (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Protocol      tinyint NOT NULL,
   Domain        varchar(64)
   Port          smallint NULL,
   UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;

作为一般规则，您将在单个域上具有相似的路径，但每个路径具有不同的查询字符串。

我最初设计它的目的是将所有部分都索引在一个表中（协议、域、路径、查询字符串），但我认为上述内容占用的空间较少，并且可以更好地从中获取更好的数据。

text 往往很慢，因此您可以在使用后将“Path”更改为 varchar。大多数服务器在 URL 大约 1K 后就死机了，但我见过一些大型服务器，并且会在不丢失数据的情况下犯错误。

您的检索查询很麻烦，但如果您将其抽象到代码中，则没有问题：

SELECT CONCAT(
    IF(D.Protocol=0,'http://','https://'),
    D.Domain,
    IF(D.Port IS NULL,'',CONCAT(':',D.Port)), 
    '/', DP.Path, 
    IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;

如果端口号不是标准的（http 为非 80，https 为非 443），则存储端口号，否则将其存储为 NULL 以表示它应该不被包括在内。（您可以将逻辑添加到 MySQL，但它会变得更加丑陋。）

我总是（或从不）从路径中删除“/”以及“？”从 QueryString 中获取以节省空间。只有损失才能区分

http://www.example.com/
http://www.example.com/?

哪个，如果重要的话，那么我会改变你的策略，永远不要剥离它，而只是包含它。从技术上讲，

http://www.example.com 
http://www.example.com/

它们是相同的，因此删除路径斜杠始终是可以的。

因此，要解析：

http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords

我们将在 PHP 中使用类似 parse_url 的东西来生成：

array(
    [scheme] => 'http',
    [host] => 'www.example.com',
    [path] => '/my/path/to/my/file.php',
    [query] => 'id=412&crsource=google+adwords',
)

然后您将检查/插入（使用适当的锁，未显示）：（

SELECT D.ID FROM Domain D 
WHERE 
    D.Protocol=0 
    AND D.Domain='www.example.com' 
    AND D.Port IS NULL

如果不存在）

INSERT INTO Domain ( 
    Protocol, Domain, Port 
) VALUES ( 
    0, 'www.example.com', NULL 
);

然后我们就得到了 < code>$DomainID 继续...

然后插入到 DomainPath:（

SELECT DP.ID FORM DomainPath DP WHERE 
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';

如果不存在，则以类似方式插入）

然后我们将 $DomainPathID 继续...

SELECT U.ID FROM URL 
WHERE 
    DomainPath=$DomainPathID 
    AND QueryString='id=412&crsource=google+adwords'

并插入如果需要的话。

现在，让我指出重要，对于高性能网站来说，上述方案会很慢。您应该修改所有内容以使用某种哈希来加速SELECT。简而言之，该技术就像：

CREATE TABLE Foo (
     ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
     Hash varbinary(16) NOT NULL,
     Content text
) Type=MyISAM;

SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));

我故意从上面删除它以保持简单，但是将文本与另一个文本进行选择的比较很慢，并且对于非常长的查询字符串会中断。也不要使用固定长度索引，因为这也会破坏。对于精度很重要的任意长度字符串，哈希失败率是可以接受的。

最后，如果可以的话，在客户端执行 MD5 哈希，以节省向服务器发送大型 blob 来执行 MD5 操作。大多数现代语言都支持内置 MD5：

SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');

但我离题了。

I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:

CREATE TABLE URL (
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   DomainPath    integer unsigned NOT NULL,
   QueryString   text
) Engine=MyISAM;

CREATE TABLE DomainPath (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Domain        integer unsigned NOT NULL,
   Path          text,
   UNIQUE (Domain,Path)
) Engine=MyISAM;

CREATE TABLE Domain (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Protocol      tinyint NOT NULL,
   Domain        varchar(64)
   Port          smallint NULL,
   UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;

As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.

I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.

text tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.

Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:

SELECT CONCAT(
    IF(D.Protocol=0,'http://','https://'),
    D.Domain,
    IF(D.Port IS NULL,'',CONCAT(':',D.Port)), 
    '/', DP.Path, 
    IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;

Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)

I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between

http://www.example.com/
http://www.example.com/?

Which, if important, then I would change your tack to never strip it and just include it. Technically,

http://www.example.com 
http://www.example.com/

Are the same, so stripping the Path slash is OK always.

So, to parse:

http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords

We would use something like parse_url in PHP to produce:

array(
    [scheme] => 'http',
    [host] => 'www.example.com',
    [path] => '/my/path/to/my/file.php',
    [query] => 'id=412&crsource=google+adwords',
)

You would then check/insert (with appropriate locks, not shown):

SELECT D.ID FROM Domain D 
WHERE 
    D.Protocol=0 
    AND D.Domain='www.example.com' 
    AND D.Port IS NULL

(if doesn't exist)

INSERT INTO Domain ( 
    Protocol, Domain, Port 
) VALUES ( 
    0, 'www.example.com', NULL 
);

We then have our $DomainID going forward...

Then insert into DomainPath:

SELECT DP.ID FORM DomainPath DP WHERE 
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';

(if it doesn't exist, insert it similarly)

We then have our $DomainPathID going forward...

SELECT U.ID FROM URL 
WHERE 
    DomainPath=$DomainPathID 
    AND QueryString='id=412&crsource=google+adwords'

and insert if necessary.

Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECTs. In short, the technique is like:

CREATE TABLE Foo (
     ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
     Hash varbinary(16) NOT NULL,
     Content text
) Type=MyISAM;

SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));

I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.

Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:

SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');

But I digress.

回复收藏 0 原文

看透却不说透 2024-10-26 18:43:56

这实际上取决于您想对数据做什么。如果您正在对 URL 进行统计，例如查看哪些域最受欢迎，那么通过它进行细分是可行的。但如果您只是存储它们并访问整个 url，则没有理由将它们分开。

我见过一些人对长字符串（例如 md5）进行哈希处理并对其进行搜索，URL 的性能可能会有所提高，但我不确定提高了多少（对于大量文本来说更好）。

无论你做什么 - 不要忘记尽可能使用整数作为主键，因为它们是最快的查找。

如果你真的想分割你的URL，你可能需要考虑保留单独的表，以免锁定你的表（在innoDB中，这并不重要，因为表不会被锁定），但是使用单独的表，你可以使用foreign/primary_keys/ints 来引用您需要的行。

值得一读的是 friendfeed 的博客文章 - 这可能会给你一些想法出色地：

回复收藏 0 原文

~没有更多了~

关于作者

夏至、离别

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

对于读写密集型应用程序，在 mysql 中存储 url 的最佳方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

对于读写密集型应用程序，在 mysql 中存储 url 的最佳方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。