识别 CouchDB 中的重复项

发布于 2024-12-29 07:00:44 字数 1251 浏览 1 评论 0原文

总的来说，我对 CouchDB 和面向文档的数据库很陌生。

我一直在使用 CouchDB，并且能够熟悉创建文档（使用 perl）以及使用 Futon 中的 Map/Reduce 函数来查询数据和创建视图。

我仍在试图弄清楚的一件事是如何使用 Futon 的 Map/Reduce 识别文档中的重复值。

例如，如果我有以下文档：

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

并且我想获取具有重复“名称”值的文档 ID 列表，我可以使用 Futon Map/Reduce 执行此操作吗？

希望达到的结果如下：

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..或..

{
  "carl": [ "123", "124" ]
}

.. 这将是该值，以及包含这些重复值的关联文档 ID。

我已经使用 Map/Reduce 尝试了一些不同的操作，但据我了解，Map 函数在每个文档的基础上处理数据，而 Reduce 函数只允许您使用给定的键/值文档。

我知道我可以用 perl 提取我需要的数据，在那里发挥魔法，并得到我想要的结果，但我现在尝试只使用 CouchDB，以便更好地理解它的好处/局限性。

我考虑这样做的另一种方法是使用单个文档，例如 RDBMS 表：

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

..这应该允许我按照我最初的想法使用 Map/Reduce 函数。但是我不确定这是否理想。

我知道我的思想仍然停留在 RDBMS 领域，因此我上面尝试做的很多事情可能没有必要。任何对此的见解将不胜感激。

谢谢！

编辑：修复了一些示例中的 JSON 语法。

原文

I'm new to CouchDB and document-oriented databases in general.

I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.

One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.

For example, if I have the following documents:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?

The result was hoping to achieve is as follows:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..or..

{
  "carl": [ "123", "124" ]
}

.. which would be the value, and associated document ids which contain those duplicate values.

I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.

I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.

Another way I'm thinking about doing this is to use a single document like an RDBMS table:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.

I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.

Thanks!

Edit: Fixed JSON syntax in some of the examples.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春风十里 2025-01-05 07:00:44

如果您只想要一个唯一值的列表，那很简单。如果您想识别重复项，那么事情就变得不那么容易了。

在这两种情况下，像这样的映射函数就足够了：

function (doc) {
   emit(doc.name);
}

对于您的reduce函数，只需输入_count。

您的视图输出将如下所示：（基于您的 2 个文档）

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

从那里，您将获得一个名称列表及其频率。您可以获取该列表并自行过滤，也可以采用“所有沙发”路线并使用 _list< /code> 函数来执行最终过滤。

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

了解 _list 函数，它们非常方便且用途广泛。

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.

In both cases, a map function like this should suffice:

function (doc) {
   emit(doc.name);
}

For your reduce function, just enter _count.

Your view output will look like: (based on your 2 documents)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

Read up about _list functions, they're pretty handy and versatile.

回复收藏 0 原文

~没有更多了~