如何在 Mongodb 中处理数据库清除

发布于 2024-12-28 02:06:11 字数 433 浏览 3 评论 0原文

我使用 mongodb 存储 30 天的数据,这些数据以流的形式发送给我。我正在寻找一种清除机制,通过该机制我可以丢弃最旧的数据,为新数据创造空间。我以前使用mysql,我用分区来处理这种情况。我保留了 30 个基于日期的分区。我删除了最旧的分区并创建了一个新分区来保存新数据。

当我在 mongodb 中映射相同的内容时,我感觉像使用基于日期的“分片”。但问题是它使我的数据分布很糟糕。如果所有新数据都在同一个分片中,那么该分片将会非常热,因为有很多人访问它们,并且包含旧数据的分片将被用户加载较少。

我可以进行基于集合的清除。我可以有 30 个集合,并且可以丢弃最旧的集合以容纳新数据。但有几个问题是:1)如果我将集合变小,那么我就无法从分片中受益匪浅,因为分片是针对每个集合完成的。 2) 我的查询必须更改为从所有 30 个集合中查询并采用并集。

请建议我一个好的清除机制(如果有)来处理这种情况。

I use mongodb for storing 30 day data which come to me as a stream. I am searching for a purging mechanism by which I can throw away oldest data to create room for new data. I used to use mysql in which I handled this situation using partitions. I kept 30 partitions which are date based. I delete the oldest dated partition and created a new partition to hold new data.

When I map the same thing in mongodb, I feel like using a date based 'shards'. But the problem is that it makes my data distribution bad. If all the new data are in the same shard, then that shard will be so hot as there are lot of people accessing them and the shards containing older data will be less loaded by users.

I can have a collection based purging. I can have 30 collections and I can throw away the oldest collection to accommodate new data. But couple of problems are 1) If I make collections smaller then I cannot benefit much from sharding as they are done per collection. 2) My queries have to change to query from all 30 collections and take an union.

Please suggest me a good purging mechanism (if any) to handle this situation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

莫多说 2025-01-04 02:06:11

在 MongoDB 中实际上只有三种方法可以进行清除。看起来您已经确定了一些权衡。

  1. 单个集合,删除旧条目
  2. 每天收集,删除旧集合
  3. 每天数据库,删除旧数据库

选项#1:单个集合

优点

  • 易于实施
  • 易于运行 Map/Reduce

缺点

  • 删除与插入一样昂贵,会导致大量 IO 并需要对数据库进行“碎片整理”或“压缩”。
  • 在某些时候,您最终会处理双倍的“写入”,因为您必须插入一天的数据并删除一天的数据。

选项#2:每天收集

优点

  • 通过 collection.drop() 删除数据非常快。
  • 仍然对 Map/Reduce 友好,因为每天的输出可以根据摘要数据进行合并或重新缩减。

缺点

  • 您可能仍然遇到一些碎片问题。
  • 您将需要重写查询。但是,根据我的经验,如果您有足够的要清除的数据,则您很少直接访问该数据。相反,您倾向于对该数据运行 Map/Reduce。所以这可能不会改变很多查询。

选项#3:每天数据库

优点

  • 删除速度尽可能快,文件只会被截断。
  • 零碎片问题,易于备份/恢复/归档旧数据。

缺点

  • 将使查询更具挑战性(期望编写一些包装器代码)。
  • 编写 Map/Reduce 并不那么容易,尽管看看聚合框架,因为无论如何它可能会更好地满足您的需求。

现在有选项#4,但它不是通用解决方案。我知道有些人只是通过使用 Capped Collections 来进行“清除”。在某些情况下这肯定是有效的,但它有很多警告,所以你真的需要知道你在做什么。

There are really only three ways to do purging in MongoDB. It looks like you've already identified several of the trade-offs.

  1. Single collection, delete old entries
  2. Collection per day, drop old collections
  3. Database per day, drop old databases

Option #1: single collection

pros

  • Easy to implement
  • Easy to run Map/Reduces

cons

  • Deletes are as expensive as inserts, causes lots of IO and the need to "defragment" or "compact" the DB.
  • At some point you end up handling double the "writes" as you have to both insert a day's worth of data and delete a day's worth of data.

Option #2: collection per day

pros

  • Removing data via collection.drop() is very fast.
  • Still Map/Reduce friendly as the output from each day can be merged or re-reduced against the summary data.

cons

  • You may still have some fragmenting problems.
  • You will need to re-write queries. However, in my experience if you have enough data that you're purging, you rarely access that data directly. Instead you tend to run Map/Reduces over that data. So this may not change that many queries.

Option #3: database per day

pros

  • Deletion is as fast as possible, files are simply truncated.
  • Zero fragmentation problems and easy to backup / restore / archive old data.

cons

  • Will make querying more challenge (expect to write some wrapper code).
  • Not as easy to write Map/Reduce's, though take a look at the Aggregation Framework as that may better satisfy your needs anyways.

Now there is an option #4, but it is not a general solution. I know of some people who did "purging" by simply using Capped Collections. There are definitely cases where this works, but it has a bunch of caveats, so you really need to know what you're doing.

寂寞陪衬 2025-01-04 02:06:11

我们可以为 mongodb 2.2 版本或更高版本的收集设置 TTL。这将帮助您使收集的旧数据过期。

请点击此链接: http://docs.mongodb.org/manual/tutorial/expire-数据/

we can set TTL for collection from mongodb 2.2 release or higher. this will help you to expire old data from collection.

Follow this link: http://docs.mongodb.org/manual/tutorial/expire-data/

风筝有风,海豚有海 2025-01-04 02:06:11

我遇到了类似的情况,这个页面帮助了我,特别是底部的“有用的脚本”部分。 http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

I had a similar situation and this page helped me out, especially the "Helpful Scripts" section at the bottom. http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

昵称有卵用 2025-01-04 02:06:11

最好保留一台服务器作为存档
每 15 天进行一次净化
从存档中删除旧的..
使用更多数据分区进行存档

Better keep one server as archive
Do purging on 15 days interval
Delete old from archive..
Make archive with more data partition

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文