当前位置：文江博客话题详情

从Mysql切换到MongoDB 2亿行

发布于 2024-11-26 19:19:32 字数 793 浏览 1 评论 0原文

我们正在尝试从 mysql 迁移到 mongodb。 mysql的结构是 id_src 整数 id_dest 整数唯一键：id_src，id_dest

它们在 mysql

数据中大约有 2 亿行，例如： {id_src,id_dest} {1,2} {1,3} {1,10} {2,3} {2,10} {4,3}

我们需要检索数据： {id_dest,count} {3,3} {10,2} {2,1}

我开始在mongodb中重现mysql的结构。插入性能巨大（非常好）：大约 1 小时插入 2 亿行。

但我需要使用 MapReduce 来进行分组。地图缩小大约花费了1小时。

所以我尝试创建另一个 mongodb 结构： {id_dest,{id_src1,id_src2}}

每个文档可以有数十万个id_src。

这是我的 insert.php 代码

$res=mysql_unbuffered_query("select * from ids limit 10000100");  
while ($tab=mysql_fetch_array($res)) {  
$collection->update(array('_id'=>(int)$tab['id_dest']),array('$push' => array('src'=>(int)$tab['id_src'])),array("upsert" => true));  
}

，但在这种情况下性能非常糟糕，每秒只有很少的更新。

我做错了什么吗？

原文

We are trying to move from mysql to mongodb.
mysql structure is
id_src int
id_dest int
unique key : id_src,id_dest

They are about 200 millions rows in mysql

data exemple :
{id_src,id_dest} {1,2} {1,3} {1,10} {2,3} {2,10} {4,3}

We need to retrive data :
{id_dest,count} {3,3} {10,2} {2,1}

I started to repoduced the structure of mysql in mongodb.
Insert performance were huge (very good) : about 1 hour to insert 200 millions rows.

But I needed to use map reduce to get the group by. Map reduce took about 1 hour.

So I tried to create an other mongodb structure :
{id_dest,{id_src1,id_src2}}

Each document can have hundred thousand of id_src.

Here is my insert.php code

$res=mysql_unbuffered_query("select * from ids limit 10000100");  
while ($tab=mysql_fetch_array($res)) {  
$collection->update(array('_id'=>(int)$tab['id_dest']),array('$push' => array('src'=>(int)$tab['id_src'])),array("upsert" => true));  
}

But in that case performance are very bad, only few update per seconds.

Am I doing something wrong ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紫罗兰の梦幻 2024-12-03 19:19:32

首先，Map/Reduce 并不是为实时分析而设计的。此外，MongoDB 目前仅限于一个 M/R 核心，这会进一步减慢速度。

因此，如果您要使用 M/R 来获取数据，它不会是“实时”的，它将每 X 分钟（或小时）更新一次。

这里有两种有效的方法：

增量 M/R
实时计数器

选项 1：增量 M/R

对于此选项，您可以对所有数据运行一次 M/R。然后，接下来，您只需对修改后的数据运行 M/R。如果你现在有 200M 文档，接下来可能会有 210M 文档（这意味着 M/R 会变得更慢）。但如果您只需运行新的/更改的文档，那么花费的时间应该远少于 1 小时。

在此处查看文档中的 reduce 输出选项。

同样，前提是您只需 M/R 相关数据，系统就会针对现有数据“重新缩减”。通过这种方式，您可以获得“增量”M/R。

选项#2：实时计数器

在此方法中，您有两个集合：一个用于数据，第二个用于“摘要”结果。当您插入数据时，您还会对摘要进行增量。

假设您有以下数据：

Main Collection
{src: 1, dest: 2}
{src: 1, dest: 3}
{src: 1, dest: 10}
{src: 2, dest: 3}
{src: 2, dest: 10}
{src: 4, dest: 3}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 1}

您收到一条新数据{src: 5, dest: 2}。您将进行两项更新：

db.main.insert({src: 5, dest: 2});
db.summary.update({dest: 2}, { $inc : { count: 1 } }, true); // upsert true

这是您的新数据：

Main Collection
{src: 1, dest: 2}
...
{src: 4, dest: 3}
{src: 5, dest: 2}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 2}

您会注意到我们已经更新了摘要：{dest: 2, count: 2}。

显然，这里需要权衡。您需要更多更新/插入 (2x)，但您可以获得实时计数器。现在，MongoDB 中没有事务，因此您必须决定确保这两个更新发生的策略。有很多方法可以做到这一点，我在这里无法一一介绍（请参阅消息队列了解一种方法）。

First things first, Map / Reduce is not designed for real-time analysis. Additionally, MongoDB is currently limited to one core for M/R which will slow things down even more.

So if you're going to use M/R to get data, it will not be "real-time" it will be updated every X minutes (or hours).

There are two efficient approaches here:

Incremental M/R
Real-time counters

Option #1: Incremental M/R

For this option you run the M/R once for all of the data. Then, going forward, you only run M/R on the modified data. If you have 200M documents now, you may have 210M documents next (which means that M/R gets even slower). But if you only have to run new/changed documents, then it should take much less than 1 hour.

Take a look at the reduce output option in the docs here.

Again the premise is that you only M/R the relevant data and the system "re-reduces" against the existing data. In this way you get an "incremental" M/R.

Option #2: Real-Time counters

In this method, you have two collections: one for your data, a second for the result of the "summary". When you insert into data, you also do an increment on the summary.

Assume you have this data:

Main Collection
{src: 1, dest: 2}
{src: 1, dest: 3}
{src: 1, dest: 10}
{src: 2, dest: 3}
{src: 2, dest: 10}
{src: 4, dest: 3}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 1}

You receive a new piece of data {src: 5, dest: 2}. You would do two updates:

db.main.insert({src: 5, dest: 2});
db.summary.update({dest: 2}, { $inc : { count: 1 } }, true); // upsert true

Here's your new data:

Main Collection
{src: 1, dest: 2}
...
{src: 4, dest: 3}
{src: 5, dest: 2}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 2}

You'll notice that we've updated our summary: {dest: 2, count: 2}.

Obviously, there are trade-offs here. You need more updates/inserts (2x), but you get real-time counters. Now, there are no transactions in MongoDB, so you'll have to decide on a strategy for ensuring that both updates happen. There are lots of ways to do this which I cannot go into here (see message queues for one method).

回复收藏 0 原文

~没有更多了~