MongoDB 产生的存储开销是否超过 100%?即,我插入22GB数据,它在磁盘上占用50GB

发布于 2024-12-29 04:38:39 字数 4185 浏览 0 评论 0原文

我做了一个简单的实验来测试MongoDB的性能和磁盘使用情况。 我插入了22GB数据,但它占用了磁盘50GB。我将在下面详细描述这个实验。

设置:

  • 版本 - MongoDB 2.0.2。
  • 环境: 1)单节点,没有任何复制或分片。 2) 通过 VirtualBox 的虚拟机。 3) Linux Ubuntu 64位。 4)100GB固定虚拟磁盘和1GB内存
  • 语言:C#&& MongoDB C# 驱动程序
  • 目标和过程:非常简单。我只是不断创建一个新的 {KEY, VALUE} 对并将其插入到 MongoDB 中。
  • *插入次数 = 1024 * 1024 * 1024 / 3
  • KEY 的大小 = 20 字节(字节数组),每次插入递增 1 的计数器,即 KEY = {1, 2, 3, ..., 1024*1024*1024}
  • VALUE 的大小 = 100 字节(字节数组),随机通过 Random 类生成。

结果:

所以这个实验意味着我希望将大约40GB的数据(每次插入120字节数据)插入到MongoDB中,我相信它很简单。然而,当实际插入的数据达到22GB时,我就停下来了,因为我发现了存储开销问题。 我插入的实际数据约为22GB,但所有indexdb.*文件的大小均为50GB。所以有超过100%的存储开销。

我自己的想法:

我读过相当多的MongoDB文档。根据我所读到的内容,存储可能有两种开销。

  1. 操作日志。但它的上限是大约 5% 的磁盘空间。就我而言,它的上限约为 5GB。
  2. 预分配的数据文件。我没有改变mongod的任何设置,所以我认为是提前2GB。让我假设最新使用的 2GB 文件几乎是空的,因此总共最多有 4GB 开销。

所以根据我的计算,无论我插入多大的数据,最多应该有9GB的开销。但现在开销是50GB - 22GB = 28GB。我不知道那 28GB 里面有什么。而如果这个开销始终超过100%,那就是相当多了。

任何人都可以向我解释一下吗?


这是我从 mongo shell 获得的一些 mongodb 统计信息。

db.serverStatus() {
"host" : "mongodb-VirtualBox",
"version" : "2.0.2",
"process" : "mongod",
"uptime" : 531693,
"uptimeEstimate" : 460787,
"localTime" : ISODate("2012-01-26T16:32:12.888Z"),
"globalLock" : {
     "totalTime" : 531692893756,
     "lockTime" : 454374529354,
     "ratio" : 0.8545807827977436,
     "currentQueue" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     },
     "activeClients" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     }
},
"mem" : {
     "bits" : 64,
     "resident" : 292,
     "virtual" : 98427,
     "supported" : true,
     "mapped" : 49081,
     "mappedWithJournal" : 98162
},
"connections" : {
     "current" : 3,
     "available" : 816
},
"extra_info" : {
     "note" : "fields vary by platform",
     "heap_usage_bytes" : 545216,
     "page_faults" : 14477174
},
"indexCounters" : {
     "btree" : {
          "accesses" : 3808733,
          "hits" : 3808733,
          "misses" : 0,
          "resets" : 0,
          "missRatio" : 0
     }
},
"backgroundFlushing" : {
     "flushes" : 8861,
     "total_ms" : 26121675,
     "average_ms" : 2947.93759169394,
     "last_ms" : 119,
     "last_finished" : ISODate("2012-01-26T16:32:03.825Z")
},
"cursors" : {
     "totalOpen" : 0,
     "clientCursors_size" : 0,
     "timedOut" : 0
},
"network" : {
     "bytesIn" : 44318669115,
     "bytesOut" : 50995599,
     "numRequests" : 201846471
},
"opcounters" : {
     "insert" : 0,
     "query" : 3,
     "update" : 201294849,
     "delete" : 0,
     "getmore" : 0,
     "command" : 551619
},
"asserts" : {
     "regular" : 0,
     "warning" : 0,
     "msg" : 0,
     "user" : 1,
     "rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
     "commits" : 28,
     "journaledMB" : 0,
     "writeToDataFilesMB" : 0,
     "compression" : 0,
     "commitsInWriteLock" : 0,
     "earlyCommits" : 0,
     "timeMs" : {
          "dt" : 3062,
          "prepLogBuffer" : 0,
          "writeToJournal" : 0,
          "writeToDataFiles" : 0,
          "remapPrivateView" : 0
     }
},
"ok" : 1}

db.index.dataSize(): 29791637704

db.index.storageSize(): 33859297120

db.index.totalSize(): 45272200048

db.index.totalIndexSize(): 11412902928

db.runCommand("getCmdLineOpts"): { "argv" : [ "./mongod" ], "parsed" : { }, "ok" : 1 }


我的代码片段。我刚刚删除了那些 MongoDB 连接代码并将核心保留在这里。

static void fillupDb()
{
    for (double i = 0; i < 1024 * 1024 * 1024 / 3; i++)
    {
        //Convert the counter i to a 20 bytes of array as KEY
        byte[] prekey = BitConverter.GetBytes(i);
        byte[] key = new byte[20];
        prekey.CopyTo(key, 0);

        // Generate a random 100 bytes of VALUE
        byte[] value = getRandomBytes(100);
        put(key, value);
    }
}

public void put(byte[] key, byte[] value)
{
    BsonDocument pair = new BsonDocument {
        { "_id", key } /* I am using _id as the index */,
        { "value", value }};
    collection.Save(pair);
}

I have done a simple experiment to test MongoDB's performance and disk usage. I insert 22GB data but it occupies 50GB on the disk. I will describe this experiment in details as below.

Setup:

  • Version - MongoDB 2.0.2.
  • Environment: 1) Single node without any replication or sharding. 2) VM via VirtualBox. 3) Linux Ubuntu 64bit. 4) 100GB fixed virtual disk and 1GB memory
  • Language: C# && MongoDB C# driver
  • Target and Procedure: Very simple. I just constantly create a new {KEY, VALUE} pair and insert it into MongoDB.
  • *Number of Insertion = 1024 * 1024 * 1024 / 3
  • Size of the KEY = 20 bytes (byte array), a counter with increment of 1 for each insertion, i.e., KEY = {1, 2, 3, ..., 1024*1024*1024}
  • Size of the VALUE = 100 bytes (byte array), randomly generated through Random class.

Results:

So this experiment means I wished to insert about 40GB of data (120 bytes of data for each insertion) into MongoDB and I believe it is simple enough. However, I stopped when the actual inserted data reached 22GB because I found the storage overhead issue. The actual data I inserted is about 22GB, but all the indexdb.* files are with size of 50GB. So there is more than 100% storage overhead.

My own thoughts:

I have read quite a bit of MongoDB's docs. According to what I have read, there might be two kinds of overhead for the storage.

  1. the oplog. But it is meant to be capped about 5% of disk space. In my case, it is capped about 5GB.
  2. preallocated data file. I didn't change any settings of mongod, so I think it is 2GB in advance. And let me assume that the latest 2GB file in use is nearly empty, so totally at most 4GB overhead.

So from my calculation, whatever size of data I insert, there should be 9GB overhead at most. But now the overhead is 50GB - 22GB = 28GB. And I don't get a clue what is inside that 28GB. And if this overhead is always more than 100%, it is quite a lot.

Can any one please explain it to me?


Here is some mongodb stats I obtained from the mongo shell.

db.serverStatus() {
"host" : "mongodb-VirtualBox",
"version" : "2.0.2",
"process" : "mongod",
"uptime" : 531693,
"uptimeEstimate" : 460787,
"localTime" : ISODate("2012-01-26T16:32:12.888Z"),
"globalLock" : {
     "totalTime" : 531692893756,
     "lockTime" : 454374529354,
     "ratio" : 0.8545807827977436,
     "currentQueue" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     },
     "activeClients" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     }
},
"mem" : {
     "bits" : 64,
     "resident" : 292,
     "virtual" : 98427,
     "supported" : true,
     "mapped" : 49081,
     "mappedWithJournal" : 98162
},
"connections" : {
     "current" : 3,
     "available" : 816
},
"extra_info" : {
     "note" : "fields vary by platform",
     "heap_usage_bytes" : 545216,
     "page_faults" : 14477174
},
"indexCounters" : {
     "btree" : {
          "accesses" : 3808733,
          "hits" : 3808733,
          "misses" : 0,
          "resets" : 0,
          "missRatio" : 0
     }
},
"backgroundFlushing" : {
     "flushes" : 8861,
     "total_ms" : 26121675,
     "average_ms" : 2947.93759169394,
     "last_ms" : 119,
     "last_finished" : ISODate("2012-01-26T16:32:03.825Z")
},
"cursors" : {
     "totalOpen" : 0,
     "clientCursors_size" : 0,
     "timedOut" : 0
},
"network" : {
     "bytesIn" : 44318669115,
     "bytesOut" : 50995599,
     "numRequests" : 201846471
},
"opcounters" : {
     "insert" : 0,
     "query" : 3,
     "update" : 201294849,
     "delete" : 0,
     "getmore" : 0,
     "command" : 551619
},
"asserts" : {
     "regular" : 0,
     "warning" : 0,
     "msg" : 0,
     "user" : 1,
     "rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
     "commits" : 28,
     "journaledMB" : 0,
     "writeToDataFilesMB" : 0,
     "compression" : 0,
     "commitsInWriteLock" : 0,
     "earlyCommits" : 0,
     "timeMs" : {
          "dt" : 3062,
          "prepLogBuffer" : 0,
          "writeToJournal" : 0,
          "writeToDataFiles" : 0,
          "remapPrivateView" : 0
     }
},
"ok" : 1}

db.index.dataSize(): 29791637704

db.index.storageSize(): 33859297120

db.index.totalSize(): 45272200048

db.index.totalIndexSize(): 11412902928

db.runCommand("getCmdLineOpts"): { "argv" : [ "./mongod" ], "parsed" : { }, "ok" : 1 }


My code fragment. I just removed those MongoDB connection codes and keep the cores here.

static void fillupDb()
{
    for (double i = 0; i < 1024 * 1024 * 1024 / 3; i++)
    {
        //Convert the counter i to a 20 bytes of array as KEY
        byte[] prekey = BitConverter.GetBytes(i);
        byte[] key = new byte[20];
        prekey.CopyTo(key, 0);

        // Generate a random 100 bytes of VALUE
        byte[] value = getRandomBytes(100);
        put(key, value);
    }
}

public void put(byte[] key, byte[] value)
{
    BsonDocument pair = new BsonDocument {
        { "_id", key } /* I am using _id as the index */,
        { "value", value }};
    collection.Save(pair);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

墨洒年华 2025-01-05 04:38:39

嗯,首先。如何衡量输入数据的大小?键值对可以是两个字符串或一个 JSON 对象。

此外,每个文档都添加了一些额外的填充,以方便文档在后续更新中增长。平均填充因子可以通过 db.col.stats().paddingFactor 检索。

最后,可能增加开销的不仅仅是 oplog。 _id 上总是有一个索引,在您的情况下(因为您的文档非常小),它将在磁盘空间使用方面带来显着的开销。除非您禁用它(--nojournal),否则日志也会在总数中添加相当多的字节。

希望有帮助。

Well, first of all. How do you measure the size of your input data? A key-value pair can be two strings or a JSON object.

Additionally, every document has some additional padding added to it to facilitate the document growing through subsequent updates. The average padding factor can be retrieved through db.col.stats().paddingFactor

Finally, there's more than just the oplog that may add to your overhead. There's always an index on _id which in your case (since your document are so small) will introduce significant overhead in terms of disk space usage. Unless you disabled it (--nojournal) the journal will add quite a few of bytes to the total as well.

Hope that helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文