编辑-MySQL。大型 MyISAM 表(4000 万条记录)的索引非常慢且磁盘上的大小很大
该表包含大约 40,000,000 条记录,其中包含:
CREATE TABLE `event` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`some_other_id_not_fk` int(10) unsigned default NOT NULL,
`event_time` datetime NOT NULL,
`radius` float default NULL,
`how_heavy` smallint(6) default NULL,
PRIMARY KEY (`id`),
KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8
您应该知道 some_other_id_not_fk
列并不大,它只包含 7 个不同的数字。真正的痛苦是 event_time
日期时间列,因为它包含大量不同的日期时间,并且基本上一切都是允许的:重复以及不可预测的大时间间隔,而没有记录来“覆盖”它们。您还应该知道 (some_other_id_not_fk
,event_time
) 对必须允许有重复项 :( 我知道这会导致更多问题 :(
我有一些经验优化 MySQL 表,但如此巨大的痛苦从未出现在我的视野中:/
“事物”的当前状态是:
- 在 date1 和 date2 之间按
event_time
选择(我需要这样做)是令人满意的快。:) - 我的插入速度很慢,我的意思是真的很慢!!!超过 30 秒,甚至更糟:临时禁用和启用键的加载数据过程非常慢(几个小时),主要是在启用键操作上。
- 磁盘上索引的大小比数据大小大 7 倍,
到目前为止我已经尝试了几种不同的重新索引组合,但是该数据的大小确实阻止了我对索引和列删除/创建进行实验 ?
请帮助任何人解决这个问题吗?应该使用时间戳而不是日期时间来解决我的问题吗 或者也许我应该为day
、year
等添加额外的列并为其建立索引?
The table contains about 40,000,000 records having:
CREATE TABLE `event` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`some_other_id_not_fk` int(10) unsigned default NOT NULL,
`event_time` datetime NOT NULL,
`radius` float default NULL,
`how_heavy` smallint(6) default NULL,
PRIMARY KEY (`id`),
KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8
You should know that some_other_id_not_fk
column is not big, it contains distinctively only 7 different numbers. The real pain is the event_time
datetime column, as it contains extremely large amounts of different datetime's, and basicly everything is allowed: duplicates as well as unpredictably large time intervals without records to 'cover' them. You should also know that (some_other_id_not_fk
,event_time
) pair must be allowed to have duplicates either :( I know this causes even more problems :(
I've had some experience in optimizing MySQL tables, but such a huge pain had never appeared on my horizon :/
The current state of 'the things' is:
- The selects by
event_time
between date1 and date2 (which I need to do) are satisfactorily fast. :) - My inserts are slow, I mean really SLOW!!! more then a 30 secs, and even worse: LOAD DATA procedures that temporary DISABLE and ENABLE KEYS are EXTREMELY slow(several hours), mainly on ENABLE keys operation.
- The size of the index on the disk is 7 times bigger then the size of the data
I would have tried several different combinations of re-indexing till now, but the size of that data really prevents me from experimenting on indexes and columns drop/create at will.
Please help anyone had managed this ? Should using timestamp instead of datetime solve my problem? Or maybe I should add additional columns for day
, year
,... etc and index on them ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你真的需要一个 BIGINT 吗?你也许可以通过 INT 逃脱惩罚。如果每天 24 小时每秒插入 1,000 行,则需要 136 年才能耗尽无符号 32 位整数中的所有值。
此更改将使您的表大小减少 152.5 MB(4000 万行),并将主键索引大小减少 158.8 MB(4000 万行)。
你说这只有 7 个不同的值。那么它必须是 INT 类型吗?你能用 TINYINT 代替吗?这将大大减少索引的大小。
这将使表的大小减少 114.4 MB(包含 4000 万行),并且将
some_other_id_not_fk
索引的大小减少大约相同的大小。您需要日期时间吗? DATETIME 占用 8 个字节,TIMESTAMP 占用 4 个字节。如果您可以使用 TIMESTAMP 那么这将大大减少数据和索引的大小。请注意 TIMESTAMP 字段的限制,例如 Y2K38 以及它们相对于时区和的行为方式复制。
此更改将使您的表大小减少 152.5 MB(4000 万行),并将主键索引大小减少 158.8 MB(4000 万行)。
这三项更改将显着减小数据和索引的大小。
节省的总空间
总计:852MB
正如其他人所建议的,您甚至可能不需要拥有的所有索引定义的。由于对
some_other_id_not_fk
的选择性如此之低,查询优化器很可能不会使用该索引,而是选择全表扫描。完全删除该索引将为您的索引节省大量空间。如果您可以提供一些示例查询,我可以进一步帮助您。
另外,您是否在读取负载很重的情况下插入到该表中?请记住,MyISAM 中的 SELECT 会阻止 INSERT。
更新
大多数人建议将
some_other_id_not_fk
字段移至event_time
索引中,以便新索引位于(event_time, some_other_id_not_fk)
上。我会推荐同样的内容,但有一个重要的警告。该索引非常适合仅按
event_time
进行过滤的查询,或者同时按event_time
和some_other_id_not_fk
进行过滤的查询。它不会仅用于some_other_id_not_fk
上的查询过滤 - 将发生全表扫描。此外,如果您的查询始终同时对
event_time
和some_other_id_not_fk
进行过滤,那么不会 使用索引顺序(event_time, some_other_id_not_fk)
。相反,您应该使用索引(some_other_id_not_fk, event_time)
。首先拥有选择性最少(最多重复)的字段将允许对索引进行更大的压缩,从而显着减少磁盘上的占用空间。
Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.
This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.
You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.
This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the
some_other_id_not_fk
index by approximately the same.Do you need a DATETIME? DATETIME's take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.
This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.
These three changes will significantly reduce the size of your data as well as the indices.
Total Space Savings
Total: 852MB
As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on
some_other_id_not_fk
there's a good chance the query optimizer won't even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.If you could provide some sample queries, I can help you further.
Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.
Update
Most people are suggesting moving your
some_other_id_not_fk
field into theevent_time
index so the new index would be on(event_time, some_other_id_not_fk)
. I will recommend the same, but with an important caveat.This index will be good for queries where you are filtering only on
event_time
, or if you filter on bothevent_time
andsome_other_id_not_fk
. It will not be used for queries filtering only onsome_other_id_not_fk
- a full table scan will occur.Moreover, if your queries are always filtering on both
event_time
andsome_other_id_not_fk
then do not use the index order of(event_time, some_other_id_not_fk)
. Rather, you should use the index(some_other_id_not_fk, event_time)
instead.Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.
我认为你对什么是重、什么不是的直觉是倒退的:一个多次重复几个不同选项的索引比一个有很多不同值且每个值很少重复的索引要差。
我的建议:删除
some_other_id_not_fk
上的索引并保留(some_other_id_not_fk, event_time)
。这个复合索引应该是“几乎唯一的”,使得插入开销低得多。如果可能,也删除event_time
键,除非您的查询使用该字段而不使用some_other_id_not_fk
。编辑:您说必须按时间间隔选择,然后保留
(event_time, some_other_id_not_fk)
并删除event_time
和some_other_id_not_fk
。如果您的查询使用some_other_id_not_fk
而不是event_time
,请同时保留(event_time, some_other_id_not_fk)
和(some_other_id_not_fk, event_time)< /代码>。重点是没有任何选项很少的索引。右侧有一个未使用字段的索引是可以的。
I think your intuition on what's heavy and what's not is backwards: an index with many repetitions of a few different options is much worse than an index with lots of distinct values and few repetitions of each.
My suggestion: drop the index on
some_other_id_not_fk
and keep(some_other_id_not_fk, event_time)
. This compound index should be 'almost unique', making insert overhead much lower. If possible, drop theevent_time
key also, unless you have queries that use that field withoutsome_other_id_not_fk
.edit: you say that you have to select by time interval, then keep
(event_time, some_other_id_not_fk)
and drop bothevent_time
andsome_other_id_not_fk
. if you have queries that usesome_other_id_not_fk
and notevent_time
, then keep both(event_time, some_other_id_not_fk)
and(some_other_id_not_fk, event_time)
. the point is not have any index with few options. having an index with unused fields on the right is ok.我认为您不需要 some_other_id_not_fk 上的索引(正如您所说,只有 7 个不同的值,因此该索引的选择性为 40,000,000/7 )。您所需要的只是 (event_time + [也许] some_other_id_not_fk) 上的 1 个索引;
I think you don't need an index on some_other_id_not_fk (as you said there are just 7 different values, so selectivity of that index is 40,000,000/7 ) . All you need is 1 index on (event_time + [maybe] some_other_id_not_fk);
我之前也有过类似的情况。我创建了一个具有相同结构的表,我们将其称为存档表。我每天 3:00 将数据从活动表复制到其中,并删除所有原始数据。
图表和其他静态数据是从存档表中
选择
的,当前事件被记录到活动事件中。也许这不是最佳实践,但对我来说已经足够了。
按时间分区表:MySQL 5.1 中的日期分区 (Robin Schumacher)
http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html
I had a similar situation arlier. I created a table with the same structure, let's call it the archive table. I copied the data from the active table to it every day at 3:00 and the deleted ALL the original.
The graphs and other statictics were
select
ed from the archive table, the current events were logged to the active one.Maybe it is not a best practice but worked sufficiently for me.
Partition table by time: Partitioning with Dates in MySQL 5.1 (Robin Schumacher)
http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html
我已删除所有索引并在 (
event_time
,some_other_id_not_fk
) 上创建索引。我得到以下性能指标:磁盘上的数据大小为 1Gb,磁盘上的索引大小为 1.2Gb。
从
event
中删除,其中event
.event_time
>STR_TO_DATE('20091201000000','%Y%m%d%H%i %s') 和事件
。some_other_id_not_fk
=4 |受影响的行:353543时间:65.173 秒
select * from
event
whereevent
.event_time
>STR_TO_DATE('20090401000000','%Y%m% d%H%i%s') 和事件
。event_time
<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s' ) 和事件
。(22,4,1,3) | 中的some_other_id_not_fk
集合中有 916 行,查询时间:0.030 秒索引启用了使用以下格式插入 350,000 条新记录:insert into
event
VALUES(...),(...),.. 。执行时间约为 30 秒,Yeahaaaaaa :))索引禁用 - 插入 - 索引启用 - 使用相同格式的 350,000 个新记录:插入
event
VALUES(...),(.. .),... |大约40分钟内完成。 :) 看起来像 mysql 默认转储格式,在插入之前禁用索引并在插入之后重新启用它,并不总是有利于性能,特别是当存在大尺寸索引时:)现在我对此性能感到满意。
昨晚我成功地仅在 (
event_time
) 创建索引。索引的大小略小于第一个示例。约1.1Gb。与上面列出的相同查询的性能:我只测试了 350,000 的索引禁用-启用插入。又很慢|约35分钟。
我拒绝了数据库的这种状态,因为我对选择速度不够满意,这对我来说是优先级N1。
hobodave,我只是好奇,你认为在 (
some_other_id_not_fk
,event_time
) 上建立索引而不是 (event_time
,some_other_id_not_fk
)真的会发生戏剧性的改变,变得更好吗?我的查询将始终在这两个字段上进行过滤。如果没有通过some_other_id_not_fk
过滤,我永远不会进行查询。但我可能有一个按 IN(x,y,...) 过滤大多数不同some_other_id_not_fk
的查询。正如我所说,他们并不多。我的优先事项是:
...其他一切
我还想知道为什么 1Gb 数据需要如此巨大的索引大小 1.2Gb?指数仍大于数据。我的逻辑表明,这种日期索引可以在更小的索引中完成?我说得对吗?是否有与索引类型相关的内容(可能是 BTREE)?
谢谢。你们都很棒。我正在关闭线程。
I've dropped all indexes and made an index on (
event_time
,some_other_id_not_fk
). I get the following performance indicators:1Gb Data size on disk, 1.2Gb Index size on disk.
delete from
event
whereevent
.event_time
>STR_TO_DATE('20091201000000','%Y%m%d%H%i%s') andevent
.some_other_id_not_fk
=4 | Affected rows: 353543Time: 65.173 seconds
select * from
event
whereevent
.event_time
>STR_TO_DATE('20090401000000','%Y%m%d%H%i%s') andevent
.event_time
<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s') andevent
.some_other_id_not_fk
in (22,4,1,3) | 916 rows in set, Query time: 0.030 secondsindex enabled insertion of 350,000 new records using the following format: insert into
event
VALUES(...),(...),... | performed in about 30 seconds, Yeahaaaaaa :))index disable - insertion - index enable - of 350,000 new records using the same format: insert into
event
VALUES(...),(...),... | performed in about 40 minutes. :) Looks like mysql default dump format, disabling index before inserts and re-enabling it after, is not always good for performance, especially when large size indexes are present:)For now I am satisfied with this performance.
Last evening I've managed to create index only on (
event_time
). The size of the index was slightly lower than the first example. About 1.1Gb. The performance of same queries as listed above:I've only tested index disable-enable insertiton of 350,000. It was very slow again | about 35 minutes.
I have rejected this state of the database, because I wasnt satisfied enough of the select speed, which is priority N1 for me.
hobodave, I am just curious, do you think that making the index on (
some_other_id_not_fk
,event_time
) instead of (event_time
,some_other_id_not_fk
) will really change something dramatic towards better? My queries will ALWAYS filter on both fields. I shall NEVER have a query without filtering bysome_other_id_not_fk
. But I may have a queries that filter by IN(x,y,...) most of the distinctsome_other_id_not_fk
's. As I said, they are not many.My priorities are:
... everything else
And I also wonder why such a huge index size is required 1.2Gb on 1Gb data? Index still greater then data. My logic suggests me that this kind of indexing of dates can be done in much smaller index? Am i correct? Is there something related to the index type which is probably BTREE?
Thank you. You're all great. I am closing the thread.