使用多个 LIKE 语句和 REGEXP 的查询是否可以更高效?
我正在构建一个动态查询来从我的数据库中选择已删除的域名。目前有十几行,但我很快就会获取数据,其中记录将达到 500,000 行。
该架构只是一张包含 4 列的表:
CREATE TABLE `DroppedDomains` (
`domainID` int(11) NOT NULL AUTO_INCREMENT,
`DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
`DropDate` date DEFAULT NULL,
`TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
我没有创建该架构,这是实时数据库架构。以下是示例数据:
我构建了下面可能是最复杂的查询类型。标准如下:
选择任意数量的域
- 以“开始”一词开头
- 以“结束”一词结尾
- 在域名中的任意位置包含“containsThis”一词
- 在域名中的任意位置包含“ContainsThisToo”一词
- 至少包含一位数字
- 域名必须至少包含 49 个字符。需要计算多字节 作为一个字符(我使用 CHAR_LENGTH )。
- 域名必须至少少于 65 个字符。
- TLD 必须是“org”
- DropDate 必须晚于
2009-11-01
这是到目前为止我的查询:
SELECT
*
FROM
DroppedDomains
WHERE
1=1
AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'
这是我的问题
考虑到我将有50 万行,如果我将
TLD
列设为其自己的表,并将TLD
列设为该列的外键?只有 5 个 TLD(com、net、org、info、biz)。我意识到现实世界中有更多的 TLD,但该应用程序只有 5 个。用户无法指定自己的 TLD。我知道
REGEXP
和 500,000 行可能会导致灾难。无论如何我可以避免使用REGEXP
吗?我还可以对查询进行其他优化吗?喜欢合并
LIKE
或使用其他功能,例如INSTR
?我应该实现任何特定类型的缓存机制吗?
I'm constructing a dynamic query to select dropped domain names from my database. At the moment there are a dozen rows but I'm going to get data soon which will have records of up to 500,000 rows.
The schema is just one table containing 4 columns:
CREATE TABLE `DroppedDomains` (
`domainID` int(11) NOT NULL AUTO_INCREMENT,
`DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
`DropDate` date DEFAULT NULL,
`TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
I did not create the schema, this is the live database schema. Here's sample data:
I've constructed probably the most complex type of query below. The criteria is as follows:
SELECT any number of domains which
- Start with the word 'starts'
- End with the word 'ends'
- Contain the word 'containsThis' anywhere in the domain name
- Contain the word 'ContainsThisToo' anywhere in the domain name
- Include at least one digit
- The domain name must be at least 49 characters. Multibytes need to count
as one character( I used CHAR_LENGTH
).- The domain name must be at least under 65 characters.
- The TLD must be 'org'
- The DropDate needs to be later than
2009-11-01
Here's my query so far:
SELECT
*
FROM
DroppedDomains
WHERE
1=1
AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'
Here are my questions
Would it extremely benefit the performance considering I'll have half a million rows, if I made the
TLD
column its own table and just make theTLD
column a foreign key to that? There will only be 5 TLDs ( com, net, org, info, biz ). I realize there are more TLDs in the real world, but this application will only have 5. The user cannot specify their own TLD.I know that
REGEXP
and 500,000 rows is probably a recipe for disaster. Is there anyway I can avoid theREGEXP
?Are there any other optimizations to the query I can do? Like merge
LIKE
s or use other functions such as maybeINSTR
? And should I implement any specific sort of caching mechanism?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您有一个以常量前缀开头的 LIKE 模式并且您在该字段上有一个索引时,那么该索引可用于非常快速地查找以该前缀开头的行。幸运的是,您确实遇到了这种情况:
如果只有少数值以
starts
开头,那么将很快找到这些行,并且仅针对这些行测试其他表达式。您可以通过运行EXPLAIN SELECT ...
来检查索引是否已使用。When you have a LIKE pattern that starts with a constant prefix and you have an index on that field, then the index can be used to find the rows starting with the prefix very quickly. Luckily you have exactly this situation here:
If only a few of the values start with
starts
then these rows will be found very quickly and the other expressions will only be tested for these rows. You can check that the index is used by runningEXPLAIN SELECT ...
.您应该根据您计划使用的查询来计划要创建的索引。
仅通过 DropDate,然后是索引
DropDate 将会很有用。
TLD,那么 TLD 上的索引将是
有用。
仅通过 DomainName 的长度,然后
您可以考虑添加一个字段 DomainNameLength
正是如此(以及对此的索引)所以
长度不是每次都计算的
运行查询的时间。
如果您将使用的唯一查询是您提到的复杂查询,那么马克的建议(关于域名上的索引)是最好的。
关于关于
TLD
字段的问题 1:如果您确实只有少量(例如 5 个)选项,并且您不打算使用所有可用的 tld,则可以使用
ENUM
类型。You should plan the indexes to be created according to the queries you plan to use.
only by DropDate, then an index on
the DropDate will be useful.
TLD, then an index on TLD will be
useful.
only by length of DomainName, then
you may consider adding a field DomainNameLength that
has exactly that (and an index on this) so
the length is not calculated every
time you run the query.
If your only query you'll use is the complex one you mention, then Mark's advice (about an index on DomainName) is best.
Regarding question 1 about
TLD
field:If you are really going to have only a small number (like 5) of options for this and you are not planning to use all available tlds, you could use the
ENUM
type.