编辑-MySQL。大型 MyISAM 表(4000 万条记录)的索引非常慢且磁盘上的大小很大

发布于 2024-08-20 01:45:26 字数 1167 浏览 3 评论 0原文

该表包含大约 40,000,000 条记录,其中包含:

CREATE TABLE `event` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `some_other_id_not_fk` int(10) unsigned default NOT NULL,
  `event_time` datetime NOT NULL,
  `radius` float default NULL,
  `how_heavy` smallint(6) default NULL,
  PRIMARY KEY  (`id`),
  KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
  KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8 

您应该知道 some_other_id_not_fk 列并不大,它只包含 7 个不同的数字。真正的痛苦是 event_time 日期时间列,因为它包含大量不同的日期时间,并且基本上一切都是允许的:重复以及不可预测的大时间间隔,而没有记录来“覆盖”它们。您还应该知道 (some_other_id_not_fk,event_time) 对必须允许有重复项 :( 我知道这会导致更多问题 :(

我有一些经验优化 MySQL 表,但如此巨大的痛苦从未出现在我的视野中:/

“事物”的当前状态是:

  • 在 date1 和 date2 之间按 event_time 选择(我需要这样做)是令人满意的快。:)
  • 我的插入速度很慢,我的意思是真的很慢!!!超过 30 秒,甚至更糟:临时禁用和启用键的加载数据过程非常慢(几个小时),主要是在启用键操作上。
  • 磁盘上索引的大小比数据大小大 7 倍,

到目前为止我已经尝试了几种不同的重新索引组合,但是该数据的大小确实阻止了我对索引和列删除/创建进行实验 ?

请帮助任何人解决这个问题吗?应该使用时间戳而不是日期时间来解决我的问题吗 或者也许我应该为dayyear等添加额外的列并为其建立索引?

The table contains about 40,000,000 records having:

CREATE TABLE `event` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `some_other_id_not_fk` int(10) unsigned default NOT NULL,
  `event_time` datetime NOT NULL,
  `radius` float default NULL,
  `how_heavy` smallint(6) default NULL,
  PRIMARY KEY  (`id`),
  KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
  KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8 

You should know that some_other_id_not_fk column is not big, it contains distinctively only 7 different numbers. The real pain is the event_time datetime column, as it contains extremely large amounts of different datetime's, and basicly everything is allowed: duplicates as well as unpredictably large time intervals without records to 'cover' them. You should also know that (some_other_id_not_fk,event_time) pair must be allowed to have duplicates either :( I know this causes even more problems :(

I've had some experience in optimizing MySQL tables, but such a huge pain had never appeared on my horizon :/

The current state of 'the things' is:

  • The selects by event_time between date1 and date2 (which I need to do) are satisfactorily fast. :)
  • My inserts are slow, I mean really SLOW!!! more then a 30 secs, and even worse: LOAD DATA procedures that temporary DISABLE and ENABLE KEYS are EXTREMELY slow(several hours), mainly on ENABLE keys operation.
  • The size of the index on the disk is 7 times bigger then the size of the data

I would have tried several different combinations of re-indexing till now, but the size of that data really prevents me from experimenting on indexes and columns drop/create at will.

Please help anyone had managed this ? Should using timestamp instead of datetime solve my problem? Or maybe I should add additional columns for day, year,... etc and index on them ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

你好,陌生人 2024-08-27 01:45:26
`id` bigint(20) unsigned NOT NULL auto_increment,

你真的需要一个 BIGINT 吗?你也许可以通过 INT 逃脱惩罚。如果每天 24 小时每秒插入 1,000 行,则需要 136 年才能耗尽无符号 32 位整数中的所有值。

此更改将使您的表大小减少 152.5 MB(4000 万行),并将主键索引大小减少 158.8 MB(4000 万行)。

`some_other_id_not_fk` int(10) unsigned default NOT NULL,

你说这只有 7 个不同的值。那么它必须是 INT 类型吗?你能用 TINYINT 代替吗?这将大大减少索引的大小。

这将使表的大小减少 114.4 MB(包含 4000 万行),并且将 some_other_id_not_fk 索引的大小减少大约相同的大小。

`event_time` datetime NOT NULL,

您需要日期时间吗? DATETIME 占用 8 个字节,TIMESTAMP 占用 4 个字节。如果您可以使用 TIMESTAMP 那么这将大大减少数据和索引的大小。请注意 TIMESTAMP 字段的限制,例如 Y2K38 以及它们相对于时区和的行为方式复制。

此更改将使您的表大小减少 152.5 MB(4000 万行),并将主键索引大小减少 158.8 MB(4000 万行)。

这三项更改将显着减小数据和索引的大小。

节省的总空间

  • 表:152.5 + 152.5 + 114.4 = 419.4 MB
  • 索引:158.8 + 158.8 + ~115 = 432.6 MB

总计:852MB

正如其他人所建议的,您甚至可能不需要拥有的所有索引定义的。由于对 some_other_id_not_fk 的选择性如此之低,查询优化器很可能不会使用该索引,而是选择全表扫描。完全删除该索引将为您的索引节省大量空间。

如果您可以提供一些示例查询,我可以进一步帮助您。

另外,您是否在读取负载很重的情况下插入到该表中?请记住,MyISAM 中的 SELECT 会阻止 INSERT。

更新

大多数人建议将 some_other_id_not_fk 字段移至 event_time 索引中,以便新索引位于 (event_time, some_other_id_not_fk) 上。我会推荐同样的内容,但有一个重要的警告。

该索引非常适合仅按 event_time 进行过滤的查询,或者同时按 event_timesome_other_id_not_fk 进行过滤的查询。它不会仅用于some_other_id_not_fk上的查询过滤 - 将发生全表扫描。

此外,如果您的查询始终同时对 event_timesome_other_id_not_fk 进行过滤,那么不会 使用索引顺序(event_time, some_other_id_not_fk)。相反,您应该使用索引(some_other_id_not_fk, event_time)

首先拥有选择性最少(最多重复)的字段将允许对索引进行更大的压缩,从而显着减少磁盘上的占用空间。

`id` bigint(20) unsigned NOT NULL auto_increment,

Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

`some_other_id_not_fk` int(10) unsigned default NOT NULL,

You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.

This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the some_other_id_not_fk index by approximately the same.

`event_time` datetime NOT NULL,

Do you need a DATETIME? DATETIME's take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

These three changes will significantly reduce the size of your data as well as the indices.

Total Space Savings

  • Table: 152.5 + 152.5 + 114.4 = 419.4 MB
  • Index: 158.8 + 158.8 + ~115 = 432.6 MB

Total: 852MB

As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on some_other_id_not_fk there's a good chance the query optimizer won't even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.

If you could provide some sample queries, I can help you further.

Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.

Update

Most people are suggesting moving your some_other_id_not_fk field into the event_time index so the new index would be on (event_time, some_other_id_not_fk). I will recommend the same, but with an important caveat.

This index will be good for queries where you are filtering only on event_time, or if you filter on both event_time and some_other_id_not_fk. It will not be used for queries filtering only on some_other_id_not_fk - a full table scan will occur.

Moreover, if your queries are always filtering on both event_time and some_other_id_not_fk then do not use the index order of (event_time, some_other_id_not_fk). Rather, you should use the index (some_other_id_not_fk, event_time) instead.

Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.

动听の歌 2024-08-27 01:45:26

我认为你对什么是重、什么不是的直觉是倒退的:一个多次重复几个不同选项的索引比一个有很多不同值且每个值很少重复的索引要差

我的建议:删除 some_other_id_not_fk 上的索引并保留 (some_other_id_not_fk, event_time)。这个复合索引应该是“几乎唯一的”,使得插入开销低得多。如果可能,也删除 event_time 键,除非您的查询使用该字段而不使用 some_other_id_not_fk

编辑:您说必须按时间间隔选择,然后保留 (event_time, some_other_id_not_fk) 并删除 event_timesome_other_id_not_fk。如果您的查询使用 some_other_id_not_fk 而不是 event_time,请同时保留 (event_time, some_other_id_not_fk)(some_other_id_not_fk, event_time)< /代码>。重点是没有任何选项很少的索引。右侧有一个未使用字段的索引是可以的。

I think your intuition on what's heavy and what's not is backwards: an index with many repetitions of a few different options is much worse than an index with lots of distinct values and few repetitions of each.

My suggestion: drop the index on some_other_id_not_fk and keep (some_other_id_not_fk, event_time). This compound index should be 'almost unique', making insert overhead much lower. If possible, drop the event_time key also, unless you have queries that use that field without some_other_id_not_fk.

edit: you say that you have to select by time interval, then keep (event_time, some_other_id_not_fk) and drop both event_time and some_other_id_not_fk. if you have queries that use some_other_id_not_fk and not event_time, then keep both (event_time, some_other_id_not_fk) and (some_other_id_not_fk, event_time). the point is not have any index with few options. having an index with unused fields on the right is ok.

因为看清所以看轻 2024-08-27 01:45:26

我认为您不需要 some_other_id_not_fk 上的索引(正如您所说,只有 7 个不同的值,因此该索引的选择性为 40,000,000/7 )。您所需要的只是 (event_time + [也许] some_other_id_not_fk) 上的 1 个索引;

I think you don't need an index on some_other_id_not_fk (as you said there are just 7 different values, so selectivity of that index is 40,000,000/7 ) . All you need is 1 index on (event_time + [maybe] some_other_id_not_fk);

瑶笙 2024-08-27 01:45:26

我之前也有过类似的情况。我创建了一个具有相同结构的表,我们将其称为存档表。我每天 3:00 将数据从活动表复制到其中,并删除所有原始数据。

图表和其他静态数据是从存档表中选择的,当前事件被记录到活动事件中。

也许这不是最佳实践,但对我来说已经足够了。

按时间分区表:MySQL 5.1 中的日期分区 (Robin Schumacher)

http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html

I had a similar situation arlier. I created a table with the same structure, let's call it the archive table. I copied the data from the active table to it every day at 3:00 and the deleted ALL the original.

The graphs and other statictics were selected from the archive table, the current events were logged to the active one.

Maybe it is not a best practice but worked sufficiently for me.

Partition table by time: Partitioning with Dates in MySQL 5.1 (Robin Schumacher)

http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html

╭ゆ眷念 2024-08-27 01:45:26

我已删除所有索引并在 (event_time, some_other_id_not_fk) 上创建索引。我得到以下性能指标:

  • 磁盘上的数据大小为 1Gb,磁盘上的索引大小为 1.2Gb。

  • event 中删除,其中 event.event_time>STR_TO_DATE('20091201000000','%Y%m%d%H%i %s') 和事件some_other_id_not_fk=4 |受影响的行:353543
    时间:65.173 秒

  • select * from event where event.event_time>STR_TO_DATE('20090401000000','%Y%m% d%H%i%s') 和事件event_time<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s' ) 和事件。(22,4,1,3) | 中的some_other_id_not_fk集合中有 916 行,查询时间:0.030 秒

  • 索引启用了使用以下格式插入 350,000 条新记录:insert into event VALUES(...),(...),.. 。执行时间约为 30 秒,Yeahaaaaaa :))

  • 索引禁用 - 插入 - 索引启用 - 使用相同格式的 350,000 个新记录:插入 event VALUES(...),(.. .),... |大约40分钟内完成。 :) 看起来像 mysql 默认转储格式,在插入之前禁用索引并在插入之后重新启用它,并不总是有利于性能,特别是当存在大尺寸索引时:)

现在我对此性能感到满意。

昨晚我成功地仅在 (event_time) 创建索引。索引的大小略小于第一个示例。约1.1Gb。与上面列出的相同查询的性能:

  • 删除|稍微快一点,大约30秒
  • 选择|稍微慢一点,大约0.1秒。
  • 我只测试了 350,000 的索引禁用-启用插入。又很慢|约35分钟。

    我拒绝了数据库的这种状态,因为我对选择速度不够满意,这对我来说是优先级N1。

hobodave,我只是好奇,你认为在 (some_other_id_not_fk,event_time) 上建立索引而不是 (event_time,some_other_id_not_fk)真的会发生戏剧性的改变,变得更好吗?我的查询将始终在这两个字段上进行过滤。如果没有通过 some_other_id_not_fk 过滤,我永远不会进行查询。但我可能有一个按 IN(x,y,...) 过滤大多数不同 some_other_id_not_fk 的查询。正如我所说,他们并不多。

我的优先事项是:

  1. 选择速度
  2. 插入速度
  3. 磁盘上的索引大小(因为表将增长几倍)
    ...其他一切

我还想知道为什么 1Gb 数据需要如此巨大的索引大小 1.2Gb?指数仍大于数据。我的逻辑表明,这种日期索引可以在更小的索引中完成?我说得对吗?是否有与索引类型相关的内容(可能是 BTREE)?

谢谢。你们都很棒。我正在关闭线程。

I've dropped all indexes and made an index on (event_time, some_other_id_not_fk). I get the following performance indicators:

  • 1Gb Data size on disk, 1.2Gb Index size on disk.

  • delete from event where event.event_time>STR_TO_DATE('20091201000000','%Y%m%d%H%i%s') and event.some_other_id_not_fk=4 | Affected rows: 353543
    Time: 65.173 seconds

  • select * from event where event.event_time>STR_TO_DATE('20090401000000','%Y%m%d%H%i%s') and event.event_time<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s') and event.some_other_id_not_fk in (22,4,1,3) | 916 rows in set, Query time: 0.030 seconds

  • index enabled insertion of 350,000 new records using the following format: insert into event VALUES(...),(...),... | performed in about 30 seconds, Yeahaaaaaa :))

  • index disable - insertion - index enable - of 350,000 new records using the same format: insert into event VALUES(...),(...),... | performed in about 40 minutes. :) Looks like mysql default dump format, disabling index before inserts and re-enabling it after, is not always good for performance, especially when large size indexes are present:)

For now I am satisfied with this performance.

Last evening I've managed to create index only on (event_time). The size of the index was slightly lower than the first example. About 1.1Gb. The performance of same queries as listed above:

  • the delete | slightly faster, about 30 seconds
  • the select | slightly slower, about 0.1 seconds.
  • I've only tested index disable-enable insertiton of 350,000. It was very slow again | about 35 minutes.

    I have rejected this state of the database, because I wasnt satisfied enough of the select speed, which is priority N1 for me.

hobodave, I am just curious, do you think that making the index on (some_other_id_not_fk,event_time) instead of (event_time,some_other_id_not_fk) will really change something dramatic towards better? My queries will ALWAYS filter on both fields. I shall NEVER have a query without filtering by some_other_id_not_fk. But I may have a queries that filter by IN(x,y,...) most of the distinct some_other_id_not_fk's. As I said, they are not many.

My priorities are:

  1. select speed
  2. insert speed
  3. index size on disk (as the table will grow several times more)
    ... everything else

And I also wonder why such a huge index size is required 1.2Gb on 1Gb data? Index still greater then data. My logic suggests me that this kind of indexing of dates can be done in much smaller index? Am i correct? Is there something related to the index type which is probably BTREE?

Thank you. You're all great. I am closing the thread.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文