需要索引的列太多 - 使用 mySQL 分区?

发布于 2024-10-07 05:26:01 字数 588 浏览 9 评论 0原文

我们有一个应用程序,其中有一个包含 20 多个列的表,并且所有列都可以搜索。为所有这些列建立索引会使写入查询变得非常慢;任何真正有用的索引通常都必须跨多个列,从而增加所需的索引数量。

然而,对于 95% 的搜索,只需要搜索这些行的一小部分,而且数量相当小 - 例如 50,000 行。

因此,我们考虑使用 mySQL 分区表 - 有一个基本上是 isActive 的列,这是我们划分两个分区的依据。大多数搜索查询将使用 isActive=1 运行。大多数查询将针对 50,000 行的小分区运行,并且无需其他索引即可快速运行。

唯一的问题是 isActive=1 未修复的行;即它不是基于行的日期或类似的固定内容;我们需要根据该行中数据的使用情况来更新 isActive。据我了解,这没问题;在 UPDATE 查询期间,数据只会从一个分区移动到另一个分区。

不过,我们确实在该行的 id 上有一个 PK;我不确定这是否是一个问题;该手册似乎建议分区必须基于任何主键。这对我们来说将是一个巨大的问题,因为主键 ID 没有依据行是否处于活动状态。

We have an application with a table with 20+ columns that are all searchable. Building indexes for all these columns would make write queries very slow; and any really useful index would often have to be across multiple columns increasing the number of indexes needed.

However, for 95% of these searches, only a small subset of those rows need to be searched upon, and quite a small number - say 50,000 rows.

So, we have considered using mySQL Partition tables - having a column that is basically isActive which is what we divide the two partitions by. Most search queries would be run with isActive=1. Most queries would then be run against the small 50,000 row partition and be quick without other indexes.

Only issue is the rows where isActive=1 is not fixed; i.e. it's not based on the date of the row or anything fixed like that; we will need to update isActive based on use of the data in that row. As I understand it that is no problem though; the data would just be moved from one partition to another during the UPDATE query.

We do have a PK on id for the row though; and I am not sure if this is a problem; the manual seemed to suggest the partition had to be based on any primary keys. This would be a huge problem for us because the primary key ID has no basis on whether the row isActive.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

别理我 2024-10-14 05:26:01

我不是 MySQL 专家。我的重点是 Oracle,但我多年来一直在使用分区,我发现您建议的使用非常合适,但不符合分区的主流理解。

低基数列上的索引

暂时搁置索引合并。假设您的活动行有些分散,并且与非活动行数的比例为 1:20。假设您的页面大小为 8Kb,每个块大约有 20 行。如果你得到的 isactive 记录分布非常均匀,那么每个块几乎就有 1 个。与使用索引查找相同的行相比,全表扫描读取表中的每个块/页要快得多。

因此,假设它们是集中的而不是均匀分散的。即使它们集中在 20% 的页面甚至 10% 的页面中,即使在这些情况下,全表扫描也可以胜过索引。

所以现在包括索引合并。如果在扫描 ISactive 的索引后,您没有访问该表,而是将这些结果连接到另一个索引的结果,那么最终结果集将产生读取,例如,少于 5% 的块。那么是的,isactive 上的索引和索引合并可能是一个解决方案。

这里需要注意的是,MySQL 中索引连接的实现有很多限制。确保这适用于您的情况。但你说你还有另外 20 个字段可以搜索。因此,如果您不为所有索引建立索引,以便有一个可用的第二个索引来连接 IsActive 索引,则您将不会使用索引合并/连接。

现在对一个低基数列进行分区

,如果您在该列上进行分区,您将拥有 5% 的块中 IsActive = True,并且它们将密集排列。完整分区扫描将快速生成活动记录列表,并允许将每个其他谓词应用为过滤器而不是索引查找。

但那面旗帜会改变,对吧。

在 Oracle 中,我们有一个命令允许我们启用行迁移。这意味着,当 Is_Active 从 True 更改为 False 时,移动该行所在的分区。这非常昂贵,但只比您对该列建立索引而不是按其分区时发生的索引维护多一点。在分区示例中。 Oracle 首先通过更新更改行,然后执行删除,最后执行插入。如果您对该列建立了索引,则需要更新该行,然后删除 TRUE 的索引条目,然后创建 False 的索引条目。

如果 MySQL 没有行迁移,那么您必须对 crud 包进行编程才能做到这一点。 UPDATE_ROW_ISACTIVE(pk IN number) procedure <----类似的东西)将为您执行删除和插入操作。

关于 Konerak 的回答

虽然我同意并行访问是分区的一种使用,但它并不是唯一的。但如果你点击他提供的链接,页面最底部的用户评论是:

谨防表上的选择性索引较低。如果 Index_Merge 优化与 intersect() 算法一起使用,复杂的 AND/OR WHERE 子句肯定会使您的查询非常非常慢。

这似乎符合你的情况,所以你可以接受这个评论。

I am not a MySQL expert. My focus is Oracle, but I've been working with Partitioning for years and I've come to find that your suggested use is very appropriate but not inside the mainstream understanding of partitions.

Index on low cardinality columns

Putting aside Index Merging for now. Let's say that your active rows are somewhat scattered and are a 1:20 ratio with the number of inactive rows. Say your page size is 8Kb and your get about 20 rows per block. If you get a very even distribution of isactive records, you'll have almost 1 per block. A full table scan will be much, much, much faster to read EVERY block/page in the table than using an index to find those same rows.

So let's say they are concentrated instead of evenly scattered. Even if they are concentrated in 20% of the pages or even 10% of the pages, a full table scan can out perform an index even in those cases.

So now include index merging. If after you scan the index of ISactive and you DO NOT visit the table but join those results to the results of ANOTHER index and that final result set will yield reading, say, less than 5% of your blocks. Then yes, and index on isactive and index merging could be a solution.

The caveat here is that there are a lot of limitation on the implementation of index joins in MySQL. Make sure that this works in your situation. But you said you have another 20 fields that may be searched. So if you don't index all of them so there's an available second index to join the IsActive index to, you'll not be using the index merging/join.

Partitioning a low cardinality column

now if you partition on that column, you'll have 5% of the blocks with IsActive = True in them and they will be densely packed. A full partition scan will quickly yield the list of active records, and allow every other predicate to be applied as a filter instead of an index seek.

But that flag changes, right.

In Oracle we have a command that allows us to enable Row Migration. That means, when Is_Active changes from True to False, move the partition the row falls in. This is pretty expensive but only a bit more than the index maintenance that would occur if you indexed that column instead of partitioning by it. In a partitioned example. Oracle first changes the row with an update, then does a delete and then an insert. If you indexed that column, you'd do an update of the row and then the index entry for TRUE would be deleted and then an index entry for False would be create.

If MySQL doesn't have row migration then you'll have to program your crud package to do that. UPDATE_ROW_ISACTIVE(pk IN number) procedure <---- something like that) will do the delete and insert for you.

Regarding Konerak's Answer

While I agree that parallel access is ONE use of partitioning, it's not the exclusive one. But if you follow the link he provides, the user comment at the very bottom of the page is:

Beware of having low selectivity indexes on your table. A complex AND/OR WHERE clause will surely make your query very very slow if Index_Merge optimization is being used with an intersect() algorithm.

That seems to speak to your situation, so you can take that comment FWIW.

明明#如月 2024-10-14 05:26:01

如果您要索引那么多“列”,您可能需要重新考虑您的数据结构。例如,将每一列改为行/记录。然后使用“组 ID”将各个记录链接在一起,并使用“名称”字段来指示它是什么数据。那么你的所有数据只需要 1 个索引。

这种名称/值对设置现在实际上相当常见,并且是一些 noSQL 数据库的基础。您可能还想研究一下其他内容。像 MongoDB 这样的东西非常适合索引“所有”数据。

If you are going to index that many "column" you may want to rethink your data structure. For example, make each column a row/record instead. Then have a "group ID" to link the individual records together, and a "name" field to indicate what piece of data it is. Then you only need 1 index for all your pieces of data.

This name/value pair setup is actually fairly common now and is what some noSQL databases are based on. Which is something else you may want to look into. Something like MongoDB is excellent for indexing "all" pieces of data.

腻橙味 2024-10-14 05:26:01

您不需要为此进行分区 - 只需在 isActive 列上建立一个索引就足够了。注意MySQL可以使用索引合并操作使用两个索引。

当分区允许并行执行搜索时,分区会很有用:例如,如果按日期分区,则可以同时搜索 5 个分区以查找跨度 5 年的结果。

You don't need partitions for this - just an Index on your isActive column would be enough. Note that MySQL can use the Index Merge operation to use both indexes.

Partitions would be useful when they would allow to execute the searches in parallel: eg if you partition per date, you can search 5 partitions simultaneously to find results spanning 5 years.

时光磨忆 2024-10-14 05:26:01

您对“表”和“数据库”的描述是缺乏标准化的典型症状。具有 20 个可搜索列的“表”不是 3NF,甚至可能不是 1NF。最好的建议是回到首要原则,并对数据进行标准化,这将导致表更窄,每个表的行数也更少,但当然,微粒表。然而,结果中每个表和总体上的索引也较少。

以及更快的数据库。宽大的“桌子”对于各个层面的表现来说都是一场灾难。

分区在这里不适用,它们不会缓解您的问题。

id PK 是一个附加索引,是真正主键的代理、替代品(但不是替代品)。如果您使用关系建模技术,则可以消除这种情况,至少可以将可搜索索引减少到 19 个。例如,正如您从重新分区的限制中看到的那样,“桌子”上的任何严肃工作都将围绕真正的 PK 而不是代理。

如果您想讨论它,请发布您的“表”的 DDL,以及每个连接的“表”。

回复评论

该表最好被视为“电子邮件”,但有很多额外的字段(类别/部门/优先级/工作流程/所有者),这些字段都已正确规范化。还有一系列其他变量,包括大量时间戳。

这就是平面文件的定义,位于0NF。除非您使用“标准化”的一些不成文的定义,否则根据您自己的描述,它根本没有标准化。这是开始任何标准化之前开始的文章。

  • 毫无疑问,索引也将是宽范围的,以便对查询有用。

  • 您可能还没有意识到,该文件中存在大量数据重复,并且存在更新异常(当您更新一行中的列时,您必须更新其他行中的重复值),这使得您的应用程序不必要的复杂。

您需要了解所有关系型 DBMS 供应商都编写了经过优化以处理关系型数据库的关系型数据库引擎。这意味着它们针对标准化结构进行了优化,而不是针对非标准化或非标准化结构。

我不会陷入学术争论,SO是问答网站,而不是辩论网站。根据要求,发布文件的 DDL 以及所有连接的文件,我们绝对可以 (a) 提高速度并 (b) 避免 20 多个索引(这是该情况的另一个常见症状)。这将处理一个具体的现实世界问题并解决它,并避免争论。

其次,你似乎把角色搞混了。是你,有问题,在 SO 上发布问题,是我修复了数百个性能问题,并进行了回答。根据定义,解决方案超出了您的领域,否则您就会解决它,因此您不会发布问题;所以当你告诉我如何解决你的问题时它不起作用。这将使我受到与你们相同的限制,从而确保我无法解决问题。

此外,根据我们的测试,需要将大量表加入 WHERE 子句中,只会使查询速度变慢。

实际上,我以调整数据库为生,并且我有数百个测试证明连接许多较小的表速度更快。研究编码器的测试和编码能力会很有趣,但这会引发争论,所以我们不要这样做;让我们继续讨论这个问题。如果您想要(a)认真测试(b)证明我在受到挑战之前所说的内容的示例,这里只是 一个示例完整记录并接受审查和相应的测试与,甲骨文世界的中坚力量。

您可能也对此问题/感兴趣答案,它终止了您正在处理的同一场辩论。

加入不需要任何费用。您加入的文件;以及两侧加入的记录数;指数的有用性,这就是成本所在。如果它是另一个非标准化文件(又厚又宽,有很多可选列),那么速度肯定会很慢。

无论如何,如果您确实有兴趣解决您发布的问题,请发布您的所有 DDL,我们可以为您提供更快的解决方案。如果您想要的只是重新分区的是/否答案(并且不解决引发问题),那也没关系;你已经有了。

Your description of the "table" and the "database" are classic symptoms of a lack of Normalisation. A "table" with 20 searchable columns is not 3NF and probably not even 1NF. The best advice is to go back to first principles, and normalise the data, that will result in much narrower tables, and also fewer rows per table, but sure, mote tables. However the result also has fewer indices, per table, and overall.

And a much faster database. Fat-wide "tables" are a disaster for performance, at every level.

Partitions do not apply here, they will not ease your problem.

An id PK is an additional column and index, a surrogate, a substitute (but not a replacement) for the real Primary Key. If you used Relational modelling techniques, that can be eliminated, at least getting down to 19 searchable indices. Any serious work on the "table" will be centred around the real PK, not the surrogate, for example, as you have seen from the restrictions re Partitions.

If you wish to discuss it, please post your DDL for the "table", plus every connected "table".

Response to Comments

The table is best thought of as "emails" but with a lot of extra fields (category/department/priority/workflow/owner) which are all properly normalised. There are a range of other variables as well including quite a lot of timestamps.

That's the very definition of a flat file, at 0NF. Unless you are using some unwritten definition of "Normalisation", it is, by your own description, not Normalised at all. It is the article one starts with before any Normalisation is commenced.

  • No doubt the indices will be fat-wide as well, in order to be useful for queries.

  • and you may not have realised yet, there is massive data duplication in that file, and Update Anomalies (when you update a column in one row, you have to update the duplicated value in the other rows), which makes your application unnecessarily complex.

You need to understand that all the Relational DBMS vendors write Relational database engines that are optimised to handle Relational databases. That means they are optimised for Normalised, not Unnormalised or Denormalised, structures.

I will not be drawn into academic arguments, and SO is question-and-answer site, not a debating site. As requested, post your DDL for the file, and all connected files, and we can definitely (a) give it some speed and (b) avoid 20+ indices (which is another common symptom of the condition). That will deal with a specific real world issue and solve it, and avoid debate.

Second, you seem to have the roles mixed up. It is you, with the problem, posting the question on SO, and it is me who has fixed hundreds of performance problems, answering. By definition the solution is outside your domain, otherwise you would have solved it, and thus you would not be posting a question; so it does not work when you tell me how to fix your problem. That would be tying me up in the same limitations that yo have, and thus ensuring that I do not fix the problem.

Also from our tests, having lots of tables to JOIN against that we need to include in the WHERE clause only makes the query slower.

Actually I tune databases for a living, and I have hundreds of tests that demonstrate joining many, smaller, tables is faster. It would be interesting to look into the test and the coding capability of the coder, but that would start a debate, so let's not do that; let's stick to the question. If you want examples of (a) serious testing which (b) proves what I have stated before being challenged, here's just one example fully documented and under scrutiny of, and corresponding test with, stalwarts in the Oracle world.

You may also be interested in this question/answer, which killed the same debate you are approaching.

Joins cost nothing. The files you join to; and the number of records joined on either side; the usefulness of an indices, that is where the cost lies. If it is another Unnormalised file (fat, wide, many optional columns), sure it will be slow.

Anyway, if you are genuinely interested in fixing your posted problem, post all your DDL and we can make it faster for you. If all you want is a yes/no answer re partitions (and to not address the causative problem), that's fine too; you already have that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文