mongodb:如果不存在则插入

发布于 2024-08-31 18:04:48 字数 820 浏览 5 评论 0原文

每天,我都会收到大量文件(更新)。我想要做的是插入每个尚不存在的项目。

  • 我还想跟踪我第一次插入它们的时间,以及我最后一次在更新中看到它们的时间。
  • 我不想有重复的文件。
  • 我不想删除以前保存过但不在我的更新中的文档。
  • 95%(估计)的记录每天都没有被修改。

我正在使用 Python 驱动程序 (pymongo)。

我目前所做的是(伪代码):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

我的问题是它非常慢(不到 100 000 条记录需要 40 分钟,而我的更新中有数百万条记录)。 我很确定有一些内置的东西可以做到这一点,但是 update() 的文档是 mmmhhh.... 有点简洁.... (http://www.mongodb.org/display/DOCS/Updating

有人可以建议如何更快地完成它吗?

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don't want to have duplicate documents.
  • I don't want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update).
I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )

Can someone advise how to do it faster?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

笑红尘 2024-09-07 18:04:48

听起来您想要执行 upsert。 MongoDB 对此有内置支持。将额外参数传递给 update() 调用:{upsert:true}。例如:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

这完全取代了 if-find-else-update 块。如果密钥不存在,它将插入;如果存在,它将更新。

之前:

{"key":"value", "key2":"Ohai."}

之后:

{"key":"value", "key2":"value2", "key3":"value3"}

您还可以指定要写入的数据:

data = {"$set":{"key2":"value2"}}

现在您选择的文档将仅更新 key2 的值,而其他所有内容均保持不变。

Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.

Before:

{"key":"value", "key2":"Ohai."}

After:

{"key":"value", "key2":"value2", "key3":"value3"}

You can also specify what data you want to write:

data = {"$set":{"key2":"value2"}}

Now your selected document will update the value of key2 only and leave everything else untouched.

阿楠 2024-09-07 18:04:48

从 MongoDB 2.4 开始,您可以使用 $setOnInsert (http: //docs.mongodb.org/manual/reference/operator/setOnInsert/)

使用 $setOnInsert 设置 insertion_date 并使用 last_update_date 设置upsert 命令中的 $set

要将伪代码转换为工作示例:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        filter={
            '_id': document['_id'],
        },
        update={
            '$setOnInsert': {
                'insertion_date': now,
            },
            '$set': {
                'last_update_date': now,
            },
        },
        upsert=True,
    )

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)

Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.

To turn your pseudocode into a working example:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        filter={
            '_id': document['_id'],
        },
        update={
            '$setOnInsert': {
                'insertion_date': now,
            },
            '$set': {
                'last_update_date': now,
            },
        },
        upsert=True,
    )
画中仙 2024-09-07 18:04:48

您始终可以创建唯一索引,这会导致 MongoDB 拒绝冲突的保存。考虑使用 mongodb shell 完成以下操作:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }

You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:

> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13})      # This works
> db.getCollection("test").insert({a:1, b:12, c:13})      # This fails
E11000 duplicate key error index: foo.test.$a_1  dup key: { : 1.0 }
一曲琵琶半遮面シ 2024-09-07 18:04:48

您可以将 Upsert 与 $setOnInsert 运算符一起使用。

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})

You may use Upsert with $setOnInsert operator.

db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})
爱冒险 2024-09-07 18:04:48

摘要

  • 您有一个现有的记录集合。
  • 您有一组记录,其中包含对现有记录的更新。
  • 有些更新并没有真正更新任何内容,它们会重复您已有的内容。
  • 所有更新都包含已经存在的相同字段,只是可能有不同的值。
  • 您想要跟踪记录最后一次更改的时间以及值实际更改的位置。

注意,我假设 PyMongo 进行更改以适合您选择的语言。

说明:

  1. 使用 unique=true 的索引创建集合,这样就不会得到重复的记录。

  2. 迭代您的输入记录,创建大约 15,000 条记录的批次。对于批次中的每条记录,创建一个包含要插入的数据的字典,假设每条记录都是一条新记录。将“创建”和“更新”时间戳添加到其中。使用“ContinueOnError”标志=true将其作为批量插入命令发出,因此即使其中存在重复的键(听起来好像会有),其他所有内容的插入也会发生。这会发生得非常快。批量插入岩石,我已经达到了 15k/秒的性能水平。有关ContinueOnError的更多说明,请参阅http://docs.mongodb.org/manual/core/写操作/

    记录插入发生得非常快,因此您很快就会完成这些插入。现在,是时候更新相关记录了。通过批量检索来完成此操作,比一次检索快得多。

  3. 再次迭代所有输入记录,创建 15K 左右的批次。取出钥匙(如果有一把钥匙最好,但如果没有也无济于事)。使用 db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) 查询从 Mongo 检索这堆记录。对于每条记录,确定是否有更新,如果有,则发出更新,包括更新“已更新”时间戳。

    不幸的是,我们应该注意,MongoDB 2.4 及更低版本不包含批量更新操作。他们正在解决这个问题。

关键优化点

  • 插入将大大加快您的批量操作速度。
  • 批量检索记录也会加快速度。
  • 单独更新是目前唯一可能的途径,但 10Gen 正在努力解决这个问题。据推测,这将在 2.6 中完成,尽管我不确定到那时是否会完成,还有很多事情要做(我一直在关注他们的 Jira 系统)。

Summary

  • You have an existing collection of records.
  • You have a set records that contain updates to the existing records.
  • Some of the updates don't really update anything, they duplicate what you have already.
  • All updates contain the same fields that are there already, just possibly different values.
  • You want to track when a record was last changed, where a value actually changed.

Note, I'm presuming PyMongo, change to suit your language of choice.

Instructions:

  1. Create the collection with an index with unique=true so you don't get duplicate records.

  2. Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/

    Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.

  3. Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.

    Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.

Key Optimization Points:

  • The inserts will vastly speed up your operations in bulk.
  • Retrieving records en masse will speed things up, too.
  • Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).
少年亿悲伤 2024-09-07 18:04:48

我不认为 mongodb 支持这种类型的选择性更新插入。我和 LeMiz 有同样的问题,并且在处理“创建”和“更新”时间戳时,使用 update(criteria, newObj, upsert, multi) 无法正常工作。给出以下 upsert 语句:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

场景 #1 - 'name' 为 'abc' 的文档不存在:
新文档使用“name”=“abc”、“created”= 2010-07-14 11:11:11 和“updated”= 2010-07-14 11:11:11 创建。

场景 #2 - 'name' 为 'abc' 的文档已存在,其中包含以下内容:
“名称”=“abc”,“创建”= 2010-07-12 09:09:09,“更新”= 2010-07-13 10:10:10。
更新插入后,文档现在将与场景 #1 中的结果相同。无法在 upsert 中指定在插入时设置哪些字段,以及在更新时保留哪些字段。

我的解决方案是在 critera 字段上创建唯一索引,执行插入,然后立即在“已更新”字段上执行更新。

I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:

update( { "name": "abc" }, 
        { $set: { "created": "2010-07-14 11:11:11", 
                  "updated": "2010-07-14 11:11:11" }},
        true, true ) 

Scenario #1 - document with 'name' of 'abc' does not exist:
New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.

Scenario #2 - document with 'name' of 'abc' already exists with the following:
'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10.
After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.

My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.

神回复 2024-09-07 18:04:48

1. 使用更新。

根据上面 Van Nguyen 的答案,使用更新而不是保存。这使您可以访问 upsert 选项。

注意:此方法在找到时会覆盖整个文档(来自文档

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.一个。使用 $set

如果您想要更新文档的一部分,而不是整个文档,您可以将 $set 方法与 update 结合使用。 (再次,来自文档)...
因此,如果您想设置...

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

发送为...

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

这有助于防止使用 { name: 'jason borne' } 意外覆盖所有文档。

1. Use Update.

Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.

NOTE: This method overrides the entire document when found (From the docs)

var conditions = { name: 'borne' }   , update = { $inc: { visits: 1 }} , options = { multi: true };

Model.update(conditions, update, options, callback);

function callback (err, numAffected) {   // numAffected is the number of updated documents })

1.a. Use $set

If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)...
So, if you want to set...

var query = { name: 'borne' };  Model.update(query, ***{ name: 'jason borne' }***, options, callback)

Send it as...

Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)

This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.

丿*梦醉红颜 2024-09-07 18:04:48

一般来说,在 MongoDB 中使用 update 更好,因为它只会创建文档(如果文档尚不存在),尽管我不确定如何使用 python 适配器来使用它。

其次,如果您只需要知道该文档是否存在,则仅返回数字的 count() 将是比 find_one 更好的选择,find_one 据称会从 MongoDB 传输整个文档,从而造成不必要的流量。

In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.

Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.

二手情话 2024-09-07 18:04:48

Pymongo 的方法

Python 官方 MongoDB 驱动程序

5% 的情况下您可能想要更新和覆盖,而其他时候您想插入新行,这样做就完成了使用 updateOneupsert

  • 95%(估计)的记录每天都没有被修改。

以下解决方案取自此核心 mongoDB 函数:

db.collection.updateOne(filter, update, options)

根据过滤器更新集合中的单个文档。

这是通过 Pymongo 的函数 完成的update_one(filter, new_values, upsert=True)

代码示例:

# importing pymongo's MongoClient
from pymongo import MongoClient
 
conn = MongoClient('localhost', 27017)
db = conn.databaseName
 
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
 
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
 
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)

upsert=True 做什么?

  • 如果没有文档与过滤器匹配,则创建一个新文档。
  • 更新与过滤器匹配的单个文档。

Method For Pymongo

The Official MongoDB Driver for Python

5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert

  • 95% (estimated) of the records are unmodified from day to day.

The following solution is taken from this core mongoDB function:

db.collection.updateOne(filter, update, options)

Updates a single document within the collection based on the filter.

This is done with this Pymongo's function update_one(filter, new_values, upsert=True)

Code Example:

# importing pymongo's MongoClient
from pymongo import MongoClient
 
conn = MongoClient('localhost', 27017)
db = conn.databaseName
 
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
 
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
 
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)

What upsert=True Do?

  • Creates a new document if no documents match the filter.
  • Updates a single document that matches the filter.
少女净妖师 2024-09-07 18:04:48

我现在建议使用await。

I do propose the using of await now.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文