用于日志数据的 MongoDB 集合:是否索引?
我使用 MongoDB 作为临时日志存储。该集合每小时接收约 400,000 个新行。每行包含一个 UNIX 时间戳和一个 JSON 字符串。
我想定期将集合的内容复制到 S3 上的文件中,每小时创建一个包含约 400,000 行的文件(例如,today_10_11.log 包含上午 10 点到上午 11 点之间收到的所有行)。我需要在集合接收插入内容时进行此复制。
我的问题:在每小时 400,000 次插入的时间戳列上建立索引与查询一小时的行所需的额外时间相比,对性能有何影响。
有问题的应用程序是使用 Ruby 编写的,在 Heroku 上运行并使用 MongoHQ 插件。
I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 400,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Mongo 默认索引 _id 字段,并且 ObjectId 已经以时间戳开头,所以基本上,Mongo 已经按照插入时间为您索引您的集合。因此,如果您使用 Mongo 默认值,则不需要索引第二个时间戳字段(甚至添加一个)。
要在 ruby 中获取对象 id 的创建时间:
生成给定时间的对象 id:
因此,例如,如果您想加载过去一周插入的所有文档,您只需搜索大于 Past_id 的 _ids并且小于 id。因此,通过 Ruby 驱动程序:
当然,您还可以添加一个单独的时间戳字段,并为其建立索引,但是当 Mongo 已经使用其默认 _id 字段为您完成必要的工作时,承受性能损失是没有意义的。
有关对象 ID 的更多信息。
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
To generate the object ids for a given time:
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.
我有一个像你这样的应用程序,目前它有 1.5 亿条日志记录。以每小时 400k 的速度,这个数据库会很快变大。每小时 400k 次插入,在时间戳上建立索引将比执行无索引查询更有价值。我可以在一小时内使用索引时间戳插入数千万条记录,但如果我对时间戳执行未索引查询,则在 4 个服务器分片(CPU 限制)上需要几分钟时间。索引查询立即出现。所以一定要索引它,索引的写入开销并没有那么高,每小时 400k 记录对于 mongo 来说并不算多。
不过,您必须注意的一件事是内存大小。如果每小时处理 40 万条记录,那么您每天就处理 1000 万条记录。为了将该索引保存在内存中,每天将消耗大约 350MB 的内存。因此,如果这种情况持续一段时间,您的索引可能会很快超过内存。
另外,如果您在一段时间后使用删除截断记录,我发现删除会创建大量磁盘 IO,并且它是磁盘绑定的。
I have an application like yours, and currently it has 150 million log records. At 400k an hour, this DB will get large fast. 400k inserts an hour with indexing on timestamp will be much more worthwhile than doing an unindexed query. I have no problem with inserting tens of millions of records in an hour with indexed timestamp, yet if I do an unindexed query on timestamp it takes a couple of minutes on a 4 server shard (cpu bound). Indexed query comes up instantly. So definitely index it, the write overhead on indexing is not that high and 400k records an hour is not much for mongo.
One thing you do have to look out for is memory size though. At 400k records an hour you are doing 10 million a day. That would consume about 350MB of memory a day to keep that index in memory. So if this goes for a while your index can get larger than memory fast.
Also, if you are truncating records after some time period using remove, I have found that removes create a large amount of IO to disk and it is disk bound.
当然,每次写入时您都需要更新索引数据。如果您要对数据进行大量查询,您肯定需要索引。
考虑将时间戳存储在 _id 字段而不是 MongoDB ObjectId 中。只要您存储唯一的时间戳,就可以了。 _id 不一定是 ObjectID,但在 _id 上有一个自动索引。这可能是您最好的选择,因为您不会增加额外的索引负担。
Certainly on every write you will need to update the index data. If you're going to be doing large queries on the data you will definitely want an index.
Consider storing the timestamp in the _id field instead of a MongoDB ObjectId. As long as you are storing unique timestamps you'll be OK here. _id doesn't have to be an ObjectID, but has an automatic index on _id. This may be your best bet as you won't add an additional index burden.
我只是使用一个没有索引的上限集合,其空间可以容纳 600k 行,以允许 slush。每小时一次,将集合转储到文本文件,然后使用 grep 过滤掉不是来自目标日期的行。这并不能让您利用数据库的优点,但这意味着您不必担心集合索引、刷新或任何废话。它的性能关键部分是保持集合可以自由插入,因此如果您可以在数据库上下文之外执行“硬”部分(按日期过滤),则不会产生任何明显的性能影响。 400-600k 行文本对于 grep 来说是微不足道的,并且可能不会超过一两秒。
如果您不介意每个日志中出现一点泥浆,您可以转储并 gzip 集合。您将在每个转储中获得一些较旧的数据,但除非您在转储之间插入超过 600k 行,否则您应该拥有一系列连续的日志快照,每个日志快照包含 600k 行。
I'd just use a capped collection, unindexed, with space for, say 600k rows, to allow for slush. Once per hour, dump the collection to a text file, then use grep to filter out rows that aren't from your target date. This doesn't let you leverage the nice bits of the DB, but it means you don't have to ever worry about collection indexes, flushes, or any of that nonsense. The performance-critical bit of it is keeping the collection free for inserts, so if you can do the "hard" bit (filtering by date) outside of the context of the DB, you shouldn't have any appreciable performance impact. 400-600k lines of text is trivial for grep, and likely shouldn't take more than a second or two.
If you don't mind a bit of slush in each log, you can just dump and gzip the collection. You'll get some older data in each dump, but unless you insert over 600k rows between dumps, you should have a continuous series of log snapshots of 600k rows apiece.