MyISAM 密钥长度限制，试图通过字段的 md5 来加快速度

发布于 2024-10-08 04:18:56 字数 239 浏览 2 评论 0原文

我在小型 MyISAM 表上工作 - 大约 30k 条目，大小 - 10mb。其中一个字段是 varchar(500+)，因为我使用 utf8_unicode_ci 我无法索引该字段（我达到 1000 字节限制），同时我需要基于该字段执行大量“get_or_create”查询。我正在尝试优化数据库，但速度仍然很慢。

创建附加字段（该字段将保存 varchar 值的 md5 并对其进行索引/用于查找）是一个很好的解决方案吗？有人尝试过这种方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的心境我的脸 2024-10-15 04:18:56

对我来说，使用如此宽的列作为键似乎是一个坏主意，但除此之外，你绝对可以做你建议的事情。你甚至不需要使用MD5，你所需要的只是一个产生很少冲突的哈希函数，但唯一性不是必需的。 CRC32 产生的值很小，而且速度非常快。

假设您的表如下所示：

CREATE TABLE data (lots_of_text VARCHAR(500));

将其更改为：

CREATE TABLE data (text_hash INT, lots_of_text VARCHAR(500), INDEX (text_hash));

当您插入行时，您

INSERT INTO data (lots_of_text, text_hash) 
VALUES ("lots and lots of text", CRC32("lots and lots of text"));

会执行以下操作：然后您可以像这样检索行：

SELECT lots_of_text FROM data
WHERE text_hash = CRC32("lots and lots of text")
AND lots_of_text = "lots and lots of text";

查询将使用 text_hash 上的索引，但由于 CRC32 不会生成唯一值，您仍然需要检查 lots_of_text 字段是否相等 - 但查询仍然很快，因为最多有几行具有相同的哈希值。

其变体是使用前 50 个字符左右作为散列，具有相同前 50 个字符的行数很可能很少。

To me it seems like a bad idea to use such a wide column as a key, but that aside you can definitely do something like what you suggest. You don't even need to use MD5, all you need is a hash function that produces few collisions, but uniqueness is not necessary. CRC32 produces a small value, and is very fast.

Say your table looks like this:

CREATE TABLE data (lots_of_text VARCHAR(500));

change it to this:

CREATE TABLE data (text_hash INT, lots_of_text VARCHAR(500), INDEX (text_hash));

and when you insert rows you do:

INSERT INTO data (lots_of_text, text_hash) 
VALUES ("lots and lots of text", CRC32("lots and lots of text"));

and then you can retrieve rows like this:

SELECT lots_of_text FROM data
WHERE text_hash = CRC32("lots and lots of text")
AND lots_of_text = "lots and lots of text";

the query will use the index on text_hash, but since CRC32 will not produce unique values you still need to check the lots_of_text field for equality -- but the query will still be quick since at most a few rows will have the same hash.

A variant of this is to use the first 50 characters or so as a hash, the number of rows having the same first 50 characters is very likely to be low.

回复收藏 0 原文

~没有更多了~