使用多个 LIKE 语句和 REGEXP 的查询是否可以更高效?

发布于 2024-10-21 12:18:48 字数 1734 浏览 1 评论 0原文

我正在构建一个动态查询来从我的数据库中选择已删除的域名。目前有十几行,但我很快就会获取数据,其中记录将达到 500,000 行。

该架构只是一张包含 4 列的表:

CREATE TABLE `DroppedDomains` (
  `domainID` int(11) NOT NULL AUTO_INCREMENT,
  `DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
  `DropDate` date DEFAULT NULL,
  `TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

我没有创建该架构,这是实时数据库架构。以下是示例数据:

在此处输入图像描述

我构建了下面可能是最复杂的查询类型。标准如下:

选择任意数量的域

  1. 以“开始”一词开头
  2. 以“结束”一词结尾
  3. 在域名中的任意位置包含“containsThis”一词
  4. 在域名中的任意位置包含“ContainsThisToo”一词
  5. 至少包含一位数字
  6. 域名必须至少包含 49 个字符。需要计算多字节 作为一个字符(我使用 CHAR_LENGTH )。
  7. 域名必须至少少于 65 个字符。
  8. TLD 必须是“org”
  9. DropDate 必须晚于 2009-11-01

这是到目前为止我的查询:

SELECT
*
FROM
DroppedDomains

WHERE

1=1

AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'

这是我的问题

  1. 考虑到我将有50 万行,如果我将 TLD 列设为其自己的表,并将 TLD 列设为该列的外键?只有 5 个 TLD(com、net、org、info、biz)。我意识到现实世界中有更多的 TLD,但该应用程序只有 5 个。用户无法指定自己的 TLD。

  2. 我知道 REGEXP 和 500,000 行可能会导致灾难。无论如何我可以避免使用REGEXP吗?

  3. 我还可以对查询进行其他优化吗?喜欢合并 LIKE 或使用其他功能,例如 INSTR?我应该实现任何特定类型的缓存机制吗?

I'm constructing a dynamic query to select dropped domain names from my database. At the moment there are a dozen rows but I'm going to get data soon which will have records of up to 500,000 rows.

The schema is just one table containing 4 columns:

CREATE TABLE `DroppedDomains` (
  `domainID` int(11) NOT NULL AUTO_INCREMENT,
  `DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
  `DropDate` date DEFAULT NULL,
  `TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

I did not create the schema, this is the live database schema. Here's sample data:

enter image description here

I've constructed probably the most complex type of query below. The criteria is as follows:

SELECT any number of domains which

  1. Start with the word 'starts'
  2. End with the word 'ends'
  3. Contain the word 'containsThis' anywhere in the domain name
  4. Contain the word 'ContainsThisToo' anywhere in the domain name
  5. Include at least one digit
  6. The domain name must be at least 49 characters. Multibytes need to count
    as one character( I used CHAR_LENGTH
    ).
  7. The domain name must be at least under 65 characters.
  8. The TLD must be 'org'
  9. The DropDate needs to be later than 2009-11-01

Here's my query so far:

SELECT
*
FROM
DroppedDomains

WHERE

1=1

AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'

Here are my questions

  1. Would it extremely benefit the performance considering I'll have half a million rows, if I made the TLD column its own table and just make the TLD column a foreign key to that? There will only be 5 TLDs ( com, net, org, info, biz ). I realize there are more TLDs in the real world, but this application will only have 5. The user cannot specify their own TLD.

  2. I know that REGEXP and 500,000 rows is probably a recipe for disaster. Is there anyway I can avoid the REGEXP?

  3. Are there any other optimizations to the query I can do? Like merge LIKEs or use other functions such as maybe INSTR? And should I implement any specific sort of caching mechanism?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

信愁 2024-10-28 12:18:48

当您有一个以常量前缀开头的 LIKE 模式并且您在该字段上有一个索引时,那么该索引可用于非常快速地查找以该前缀开头的行。幸运的是,您确实遇到了这种情况:

AND DomainName LIKE 'starts%ends'

如果只有少数值以 starts 开头,那么将很快找到这些行,并且仅针对这些行测试其他表达式。您可以通过运行EXPLAIN SELECT ...来检查索引是否已使用。

When you have a LIKE pattern that starts with a constant prefix and you have an index on that field, then the index can be used to find the rows starting with the prefix very quickly. Luckily you have exactly this situation here:

AND DomainName LIKE 'starts%ends'

If only a few of the values start with starts then these rows will be found very quickly and the other expressions will only be tested for these rows. You can check that the index is used by running EXPLAIN SELECT ....

毅然前行 2024-10-28 12:18:48

您应该根据您计划使用的查询来计划要创建的索引。

  • 如果您有过滤查询
    仅通过 DropDate,然后是索引
    DropDate 将会很有用。
  • 如果您有按以下分组的查询
    TLD,那么 TLD 上的索引将是
    有用。
  • 如果您有搜索查询
    仅通过 DomainName 的长度,然后
    您可以考虑添加一个字段 DomainNameLength
    正是如此(以及对此的索引)所以
    长度不是每次都计算的
    运行查询的时间。
  • 如果您有通过两个字段(例如 TLD 和 DropDate)搜索(过滤)的查询,那么您可能需要在这些字段上使用 2 列索引。
  • 等等...

如果您将使用的唯一查询是您提到的复杂查询,那么马克的建议(关于域名上的索引)是最好的。

关于关于 TLD 字段的问题 1:

如果您确实只有少量(例如 5 个)选项,并且您不打算使用所有可用的 tld,则可以使用 ENUM 类型

CREATE TABLE(
   ....
   tld ENUM('com', 'net', 'org', 'info', 'biz')
)

You should plan the indexes to be created according to the queries you plan to use.

  • if you'll have queries that filter
    only by DropDate, then an index on
    the DropDate will be useful.
  • if you'll have queries that group by
    TLD, then an index on TLD will be
    useful.
  • if you'll have queries that search
    only by length of DomainName, then
    you may consider adding a field DomainNameLength that
    has exactly that (and an index on this) so
    the length is not calculated every
    time you run the query.
  • if you'll have queries that search (filter) by two fields (e.g. TLD and DropDate), then you probably need a 2-column index on these fields.
  • etc...

If your only query you'll use is the complex one you mention, then Mark's advice (about an index on DomainName) is best.

Regarding question 1 about TLD field:

If you are really going to have only a small number (like 5) of options for this and you are not planning to use all available tlds, you could use the ENUM type.

CREATE TABLE(
   ....
   tld ENUM('com', 'net', 'org', 'info', 'biz')
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文