如何删除 Mongodb 或 Pymongo 中多个字段的重复文档
我有数十亿数据,包括集合中的几何字段,如下所示: 文档1:
{
"_id": {
"$oid": "61ea9daff9a37e64d24099c2"
},
"mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
"Date": "2021-11-19",
"Time": "19:50:55",
"geometry": {
"type": "Point",
"coordinates": [72.910606, 19.09972]
},
"ipv_4": "103.251.50.0",
"publisher": "1077c92082522992f0adcd46b31a51eb"
}
文档2:
{
"_id": {
"$oid": "61ea9daff9a37e64d24099c3"
},
"mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
"Date": "2021-11-19",
"Time": "19:50:55",
"geometry": {
"type": "Point",
"coordinates": [72.910606, 19.09972]
},
"ipv_4": "103.251.51.0",
"publisher": "1077c92082522992f0adcd46b31a53eb"
}
我需要根据“mobile_ad_id”、“日期”、“时间”和“几何”查找并删除重复文档。
因此,我将只有一份文档,而不是两份文档。
我需要对集合中的数十亿条目运行此操作,因此优化的解决方案将是理想的选择。
I have billions of data including Geometry field in a collection, like this:
Doc1:
{
"_id": {
"$oid": "61ea9daff9a37e64d24099c2"
},
"mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
"Date": "2021-11-19",
"Time": "19:50:55",
"geometry": {
"type": "Point",
"coordinates": [72.910606, 19.09972]
},
"ipv_4": "103.251.50.0",
"publisher": "1077c92082522992f0adcd46b31a51eb"
}
Doc2:
{
"_id": {
"$oid": "61ea9daff9a37e64d24099c3"
},
"mobile_ad_id": "6122d81b-750b-4cf4-9dc0-d779294f514a",
"Date": "2021-11-19",
"Time": "19:50:55",
"geometry": {
"type": "Point",
"coordinates": [72.910606, 19.09972]
},
"ipv_4": "103.251.51.0",
"publisher": "1077c92082522992f0adcd46b31a53eb"
}
I need to find and delete the duplicate documents based on "mobile_ad_id", "Date", "Time", and "geometry".
So Instead of two docs I'll have only one documents.
I need to run this for billions of entries in the collection, so an optimized solution would be ideal.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
$group
找出重复文档。$slice
id_List 并保留那些您真正想要删除的 id。$limit
获取每个聚合的部分数据。$remove
部分数据。还要确保正确删除这些数据。$limit
调大。而且花费的时间是你可以接受的。mongoplayground
我认为您正在开发物联网设备。也许您不需要删除重复项。如果有一些
查询
困扰您,您可以与我分享。由于重复的文档,这些性能很差。$group
.$slice
the id_List and keep those id that you actually want to remove.$limit
to get part of data per aggregate.$remove
part of data from previous action. Also make sure that those data are being removed correctly.$limit
of number bigger once you know how long those action will take. And the time spent is acceptable to you.mongoplayground
I think you are working on IOT devices. Maybe you don't need to remove duplicates. You can share with me if there is some
query
bothering you. And those performance is bad due to duplicates documents.