我应该如何配置我的 Mongodb 集群?

发布于 2024-11-19 06:42:55 字数 490 浏览 7 评论 0原文

我正在运行一个分片的 mongodb 环境 - 3 个 mongod 分片、1 个 mongod 配置、1 个 mongos (无复制)。

我想使用 mongoimport 将 csv 数据导入数据库。我在 210 个 csv 文件中存储了 1.05 亿条记录,增量为 500,000 条。我知道 mongoimport 是单线程的,并且我读到我应该运行多个 mongoimport 进程以获得更好的性能。然而,我尝试了这一点,但没有得到加速:

当并行运行 3 个 mongoimport 时,每个进程每秒大约有 6k 次插入(因此 18k i/s),而运行 1 个 mongoimport 时,我大约获得 20k 次插入/秒。

由于这些进程是通过单个 mongod 配置和 mongos 路由的,我想知道这是否是由于我的集群配置造成的。我的问题是,如果我以不同的方式设置集群配置,我会获得更好的 mongoimport 速度吗?我需要更多 mongos 进程吗?我应该一次启动多少个 mongoimports 进程?

I am running a sharded mongodb environment - 3 mongod shards, 1 mongod config, 1 mongos (no replication).

I want to use mongoimport to import csv data into the database. I have 105million records stored in increments of 500,000 across 210 csv files. I understand that mongoimport is single threaded and I read that I should run multiple mongoimport processes to get better performance. However, I tried that and didn't get a speed up:

when running 3 mongoimports in parallel, I was getting ~6k inserts/sec per process (so 18k i/s) vs. running 1 mongoimport, I was getting ~20k inserts/sec.

Since these processes were routed through the single mongod config and mongos, I am wondering if this is due to my cluster configuration. My question is, if I set up my cluster configuration differently, will I achieve better mongoimport speeds? Do I want more mongos processes? How many mongoimports processes should I fire off at a time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

未央 2024-11-26 06:42:55

因此,您需要做的第一件事是“预分割”您的块。

假设您已经对要导入的集合进行了分片。当您“从头开始”时,所有数据将开始进入单个节点。当该节点填满时,MongoDB 将开始将该节点“拆分”为块。一旦达到大约 8 个块(即大约 8x64MB 的索引空间),它将开始迁移块。

所以基本上,您实际上是在向单个节点写入数据,然后该节点的速度就会减慢,因为它必须读取数据并将其写入其他节点。

这就是为什么您没有看到 3 mongoimport 带来任何加速。所有数据仍然发送到单个节点,并且您正在最大化该节点的吞吐量。

这里的技巧是“预分割”数据。对于您的情况,您可能会对其进行设置,以便在每台计算机上获得大约 70 个文件的数据。然后您可以在不同的线程上导入这些文件并获得更好的吞吐量。

Craigslist 的 Jeremy Zawodny 对此有一篇合理的文章 这里。 MongoDB 网站此处有一些文档。

So, the first thing you need to do is "pre-split" your chunks.

Let's assume that you have already sharded the collection to which you're importing. When you start "from scratch", all of the data will start going to a single node. As that node fills up, MongoDB will start "splitting" that node into chunks. Once it gets to around 8 chunks (that's about 8x64MB of index space), it will start migrating chunks.

So basically, you're effectively writing to a single node and then that node is being slowed down because it has to read and write its data to the other nodes.

This is why you're not seeing any speedup with 3 mongoimport. All of the data is still going to a single node and you're maxing out that node's throughput.

The trick here is to "pre-split" the data. In your case, you would probably set it up so that you get about 70 files worth of data on each machine. Then you can import those files on different threads and get better throughput.

Jeremy Zawodny of Craigslist has a reasonable write-up on this here. The MongoDB site has some docs here.

甜宝宝 2024-11-26 06:42:55

我发现一些东西有助于批量加载。

推迟构建索引(除了必须在分片键上构建的索引),直到加载完所有内容之后。

每个分片运行一个 mongos 和 mongoimport 并并行加载。

最大的改进是:预先分割你的块。这有点棘手,因为您需要弄清楚需要多少块以及数据的大致分布方式。分割完后,你必须等待分配者将它们全部转移。

I've found a few things help with bulk loads.

Defer building indexes (except for the one you have to on the shard key) until after you've loaded everything.

Run one mongos and mongoimport per shard and load in parallel.

And the biggest improvement: Presplit your chunks. This is a bit tricky since you need to figure out how many chunks you will need and roughly how the data is distributed. After you split them, you have to wait for the distributor to move them all around.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文