如何删除 Mongodb 或 Pymongo 中多个字段的重复文档

发布于 2025-01-13 18:47:53 字数 999 浏览 1 评论 0原文

我有数十亿数据，包括集合中的几何字段，如下所示：文档1：

{
    "_id": {
        "$oid": "61ea9daff9a37e64d24099c2"
    },
    "mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
    "Date": "2021-11-19",
    "Time": "19:50:55",
    "geometry": {
        "type": "Point",
        "coordinates": [72.910606, 19.09972]
    },
    "ipv_4": "103.251.50.0",
    "publisher": "1077c92082522992f0adcd46b31a51eb"
}

文档2：

{
        "_id": {
            "$oid": "61ea9daff9a37e64d24099c3"
        },
        "mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
        "Date": "2021-11-19",
        "Time": "19:50:55",
        "geometry": {
            "type": "Point",
            "coordinates": [72.910606, 19.09972]
        },
        "ipv_4": "103.251.51.0",
        "publisher": "1077c92082522992f0adcd46b31a53eb"
    }

我需要根据“mobile_ad_id”、“日期”、“时间”和“几何”查找并删除重复文档。

因此，我将只有一份文档，而不是两份文档。

我需要对集合中的数十亿条目运行此操作，因此优化的解决方案将是理想的选择。

原文

I have billions of data including Geometry field in a collection, like this:
Doc1:

{
    "_id": {
        "$oid": "61ea9daff9a37e64d24099c2"
    },
    "mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
    "Date": "2021-11-19",
    "Time": "19:50:55",
    "geometry": {
        "type": "Point",
        "coordinates": [72.910606, 19.09972]
    },
    "ipv_4": "103.251.50.0",
    "publisher": "1077c92082522992f0adcd46b31a51eb"
}

Doc2:

{
        "_id": {
            "$oid": "61ea9daff9a37e64d24099c3"
        },
        "mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
        "Date": "2021-11-19",
        "Time": "19:50:55",
        "geometry": {
            "type": "Point",
            "coordinates": [72.910606, 19.09972]
        },
        "ipv_4": "103.251.51.0",
        "publisher": "1077c92082522992f0adcd46b31a53eb"
    }

I need to find and delete the duplicate documents based on "mobile_ad_id", "Date", "Time", and "geometry".

So Instead of two docs I'll have only one documents.

I need to run this for billions of entries in the collection, so an optimized solution would be ideal.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若能看破又如何 2025-01-20 18:47:53

使用$group找出重复文档。
$slice id_List 并保留那些您真正想要删除的 id。
使用 $limit 获取每个聚合的部分数据。
尝试从之前的操作中$remove部分数据。还要确保正确删除这些数据。
一旦您知道这些操作需要多长时间，您就可以将数字的 $limit 调大。而且花费的时间是你可以接受的。

db.collection.aggregate([
  {
    $group: {
      _id: {
        mobile_ad_id: "$mobile_ad_id",
        Date: "$Date",
        Time: "$Time",
        geometry: "$geometry"
      },
      id_List: { $push: "$_id" },
      count: { $sum: 1 }
    }
  },
  {
    $match: { count: { $gt: 1 } }
  },
  {
    $set: {
      id_List: { $slice: [ "$id_List", { $subtract: [ { $size: "$id_List" }, 1 ] } ] }
    }
  },
  {
    $limit: 1000
  }
])

mongoplayground

db.collection.remove( { _id: { $in: id_List } } )

我认为您正在开发物联网设备。也许您不需要删除重复项。如果有一些查询困扰您，您可以与我分享。由于重复的文档，这些性能很差。

Find out duplicate documents by using $group.
$slice the id_List and keep those id that you actually want to remove.
Using $limit to get part of data per aggregate.
Try to $remove part of data from previous action. Also make sure that those data are being removed correctly.
You can make the $limit of number bigger once you know how long those action will take. And the time spent is acceptable to you.

db.collection.aggregate([
  {
    $group: {
      _id: {
        mobile_ad_id: "$mobile_ad_id",
        Date: "$Date",
        Time: "$Time",
        geometry: "$geometry"
      },
      id_List: { $push: "$_id" },
      count: { $sum: 1 }
    }
  },
  {
    $match: { count: { $gt: 1 } }
  },
  {
    $set: {
      id_List: { $slice: [ "$id_List", { $subtract: [ { $size: "$id_List" }, 1 ] } ] }
    }
  },
  {
    $limit: 1000
  }
])

mongoplayground

db.collection.remove( { _id: { $in: id_List } } )

I think you are working on IOT devices. Maybe you don't need to remove duplicates. You can share with me if there is some query bothering you. And those performance is bad due to duplicates documents.

回复收藏 0 原文

~没有更多了~