当前位置：文江博客话题详情

自动压缩mongodb中删除的空间？

发布于 2024-10-09 18:18:32 字数 350 浏览 10 评论 0原文

mongodb文档说

要压缩此空间，请从 mongo shell 运行 db.repairDatabase()（请注意，此操作会阻塞并且速度很慢）。

在 http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

我想知道如何让 mongodb 自动释放已删除的磁盘空间？

ps 我们在mongodb中存储了很多下载任务，多达20GB，并且在半小时内完成了这些任务。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾城泪 2024-10-16 18:18:32

一般来说，如果您不需要收缩数据文件，则根本不应该收缩它们。这是因为在磁盘上“增长”数据文件是一项相当昂贵的操作，并且 MongoDB 可以在数据文件中分配的空间越大，碎片就越少。

因此，您应该尝试为数据库提供尽可能多的磁盘空间。

但是如果您必须缩小数据库，您应该记住两件事。

MongoDB 的数据文件增长了
加倍，因此数据文件可能是
64MB，然后 128MB，依此类推，直至 2GB（在
它停止加倍的点
将文件保留到 2GB。）
与大多数数据库一样...
进行诸如缩小之类的操作
需要安排一个单独的工作
这样做，没有“自动收缩”
MongoDB。事实上主要的 noSQL 数据库
（讨厌这个名字）只有Riak
会自动缩小。所以，你需要
使用您的操作系统创建作业
调度程序来运行收缩。您可以使用 bash 脚本，或者让作业运行 php 脚本等。

服务器端 Javascript

您可以使用服务器端 Javascript 进行收缩，并通过 mongo 的 shell 定期通过job（如 cron 或 Windows 调度服务）...

假设有一个名为 foo 的集合，您可以将下面的 javascript 保存到名为 bar.js 的文件中并运行...

$ mongo foo bar.js

javascript 文件看起来像...

// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();

print('Storage Size: ' + tojson(storage));

print('TotalSize: ' + tojson(total));

print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');

// Run repair
db.repairDatabase()

// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();

print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));

这将运行并返回类似...

MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153

按计划运行此文件（在非高峰时段），然后您就可以开始了。

上限集合

但是还有另一种选择，上限集合。

上限集合的大小是固定的
具有非常高的收藏
性能自动 FIFO 老化功能
（过期时间取决于插入顺序）。
它们有点像“RRD”概念
如果您熟悉的话。
此外，上限集合
自动、高性能、
维护插入顺序
集合中的对象；这是
对于某些用例来说非常强大
例如日志记录。

基本上，您可以限制集合的大小（或文档的数量），比如 20GB，一旦达到该限制，MongoDB 将开始丢弃最旧的记录，并在新的记录出现时将其替换。

这是一个这是保存大量数据的好方法，随着时间的推移丢弃旧数据并保持相同数量的磁盘空间使用。

In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.

So, you should try to provide as much disk-space as possible for the database.

However if you must shrink the database you should keep two things in mind.

MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.

Serverside Javascript

You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...

Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...

$ mongo foo bar.js

The javascript file would look something like ...

// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();

print('Storage Size: ' + tojson(storage));

print('TotalSize: ' + tojson(total));

print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');

// Run repair
db.repairDatabase()

// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();

print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));

This will run and return something like ...

MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153

Run this on a schedule (during none peak hours) and you are good to go.

Capped Collections

However there is one other option, capped collections.

Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.

Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.

This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.

回复收藏 0 原文

指尖微凉心微凉 2024-10-16 18:18:32

如果您无法承担系统被锁定的费用，或者没有双倍的存储空间，我有另一个解决方案可能比 db.repairDatabase() 更好。

您必须使用副本集。

我的想法是，一旦您删除了占用磁盘的所有多余数据，请停止辅助副本，擦除其数据目录，启动它并让它与主副本重新同步。

该过程非常耗时，但当您执行 rs.stepDown() 时，应该只需要几秒钟的停机时间。

这也不能自动化。嗯，可以，但我想我不愿意尝试。

回复收藏 0 原文

萤火眠眠 2024-10-16 18:18:32

运行 db.repairDatabase() 将要求您拥有等于文件系统上可用数据库当前大小的空间。当您知道数据库中剩余的集合或需要保留的数据当前使用的空间比分配的空间少得多并且您没有足够的空间来进行修复时，这可能会很麻烦。

作为一种替代方案，如果您实际上需要保留的集合很少或只需要数据的子集，那么您可以将需要保留的数据移至新数据库并删除旧数据库。如果您需要相同的数据库名称，则可以将它们移回到具有相同名称的新数据库中。只要确保重新创建任何索引即可。

use cleanup_database
db.dropDatabase();

use oversize_database

db.collection.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("cleanup_database");
    db.collection_subset.insert(doc);
});

use oversize_database
db.dropDatabase();

use cleanup_database

db.collection_subset.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("oversize_database");
    db.collection.insert(doc);
});

use oversize_database

<add indexes>
db.collection.ensureIndex({field:1});

use cleanup_database
db.dropDatabase();

对具有许多集合的数据库进行导出/删除/导入操作可能会达到相同的结果，但我尚未测试。

此外，作为一项策略，您可以将永久集合与临时/处理数据保存在单独的数据库中，并在作业完成后简单地删除处理数据库。由于 MongoDB 是无模式的，因此除了索引之外不会丢失任何内容，并且在下次运行进程的插入时将重新创建数据库和集合。只要确保您的工作包括在适当的时间创建任何必要的索引即可。

Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.

As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.

use cleanup_database
db.dropDatabase();

use oversize_database

db.collection.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("cleanup_database");
    db.collection_subset.insert(doc);
});

use oversize_database
db.dropDatabase();

use cleanup_database

db.collection_subset.find({},{}).forEach(function(doc){
    db = db.getSiblingDB("oversize_database");
    db.collection.insert(doc);
});

use oversize_database

<add indexes>
db.collection.ensureIndex({field:1});

use cleanup_database
db.dropDatabase();

An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.

Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.

回复收藏 0 原文

咿呀咿呀哟 2024-10-16 18:18:32

如果您使用副本集，在出现此问题时该副本集不可用最初是这样编写的，那么您可以设置一个过程来自动回收空间，而不会导致严重的中断或性能问题。

为此，您可以利用副本集中辅助副本的自动初始同步功能。解释一下：如果您关闭辅助节点，擦除其数据文件并重新启动它，辅助节点将从头开始从集合中的其他节点之一重新同步（默认情况下，它通过查看 ping 响应来选择最接近它的节点）次）。发生重新同步时，所有数据都会从头开始重写（包括索引），有效地执行与修复相同的操作，并回收磁盘空间。

通过在辅助服务器上运行此操作（然后逐步降低主服务器并重复该过程），您可以有效地回收整个集上的磁盘空间，同时将中断降至最低。如果您从辅助设备读取数据，则需要小心，因为这可能会使辅助设备在很长一段时间内停止轮换。您还想确保您的 oplog 窗口足以执行重新同步成功，但这通常是您想要确定是否执行此操作的事情。

要自动执行此过程，您只需运行一个脚本，以便在不同的日子（或类似的日子）为集合中的每个成员执行此操作，最好是在安静时间或维护窗口期间。该脚本的一个非常简单的版本在 bash 中看起来像这样：

注意：这基本上是伪代码 - 仅用于说明性目的 - 请勿在没有重大更改的情况下用于生产系统

#!/bin/bash 

# First arg is host MongoDB is running on, second arg is the MongoDB port

MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath

# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
    `$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
    sleep 2
done    
echo "Node is no longer primary!\n"

# Now shut down that server 
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user@$MONGOHOST sudo service mongodb stop

# Wipe the data files for that server

ssh -t user@$MONGOHOST sudo rm -rf $DBPATH
ssh -t user@$MONGOHOST sudo mkdir $DBPATH
ssh -t user@$MONGOHOST sudo chown mongodb:mongodb $DBPATH

# Start up server again
# similar to shutdown something like 
ssh -t user@$MONGOHOST sudo service mongodb start

If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.

To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.

By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.

To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:

NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES

#!/bin/bash 

# First arg is host MongoDB is running on, second arg is the MongoDB port

MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath

# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
    `$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
    sleep 2
done    
echo "Node is no longer primary!\n"

# Now shut down that server 
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user@$MONGOHOST sudo service mongodb stop

# Wipe the data files for that server

ssh -t user@$MONGOHOST sudo rm -rf $DBPATH
ssh -t user@$MONGOHOST sudo mkdir $DBPATH
ssh -t user@$MONGOHOST sudo chown mongodb:mongodb $DBPATH

# Start up server again
# similar to shutdown something like 
ssh -t user@$MONGOHOST sudo service mongodb start

回复收藏 0 原文

~没有更多了~