当前位置：文江博客话题详情

“合并”在 CouchDB 中查看有用输出的排序规则

发布于 2024-11-08 18:00:20 字数 304 浏览 7 评论 0原文

在 CouchDB 中执行“连接”时，您可以使用视图排序规则将记录分组在一起。例如，有两种文档类型客户和订单。这样您就可以返回客户，然后返回该客户的所有订单，然后返回下一个客户和订单。

问题是，如何合并行，这样如果您有 10 个客户和 40 个订单，您的输出仍然是 10 行而不是 50 行。您实际上是在客户中添加更多信息。客户排。

我相信使用 _list 或 reduce 可以解决这个问题。问题是具体如何做到这一点？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

野却迷人 2024-11-15 18:00:20

我第二个jhs答案，但我认为他的“选项2”太危险了。我是通过艰难的方式才学会的。您可以使用reduce函数来做很多好事，比如获取博客每个用户的最后一篇文章，但您不能将其用于任何不会减少返回数据量的用途。

为了用事实来支持它，我编写了这个小脚本来生成 200 个客户，每个客户有 20 个订单。

#!/bin/bash
echo '{"docs":['
for i in $(seq 1 200); do
  id=customer/$i
  echo '{"_id":"'$id'","type":"customer","name":"Customer '$i'"},'
  for o in $(seq 1 20); do
    echo '{"type":"order","_id":"order/'$i'/'$o'", "for":"'$id'", "desc":"Desc '$i$o'"},'
  done
done
echo ']}'

这是一种很可能发生的情况，抛出一个 Error: reduce_overflow_error 就足够了。

恕我直言，您拥有的两个选项是：

选项 1：优化列表函数

只需做一点工作，您就可以手动构建 JSON 响应，这样您就不需要在数组中累积订单。

我编辑了 jhs 的列表函数以避免使用任何数组，这样您就可以拥有任意数量的订单的客户。

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  var first_customer = true
    , first_order = true
    , row
    ;

  send('{"rows":[');

  while(row = getRow()) {
    if(row.key[1] === 2) {
      // Order for customer
      if (first_order) {
        first_order = false;
      } else {
        send(',');
      }
      send(JSON.stringify(row.value));
    }
    else if (row.key[1] === 1) {
      // New customer
      if (first_customer) {
        first_customer = false;
      } else {
        send(']},');
      }
      send('{"customer":');
      send(JSON.stringify(row.key[0]));
      send(',"orders":[');
      first_order = true;
    }
  }
  if (!first_customer)
    send(']}');

  send('\n]}');
}

选项 2：针对您的用例优化文档

如果您确实需要在同一个文档中包含订单，那么问问自己是否可以以这种方式存储它并避免在查询时进行任何处理。

换句话说：尝试充分利用文档数据库提供的可能性。设计最适合您的用例的文档，并减少使用它们所需的后处理。

I second jhs answer, but I think his "Option 2" is too dangerous. I learned it the hard way. You can use reduce function for many nice things like getting the last post of each user of a blog, but you cannot use it for anything which does not reduce the amount of data returned.

To support it with facts, I made this little script to generate 200 customers with 20 orders each.

#!/bin/bash
echo '{"docs":['
for i in $(seq 1 200); do
  id=customer/$i
  echo '{"_id":"'$id'","type":"customer","name":"Customer '$i'"},'
  for o in $(seq 1 20); do
    echo '{"type":"order","_id":"order/'$i'/'$o'", "for":"'$id'", "desc":"Desc '$i$o'"},'
  done
done
echo ']}'

It is a very likely scenario and it is enough to throw a Error: reduce_overflow_error.

IMHO the two option you have are:

Option 1: Optimized list function

With a little bit of work, you can build the JSON response by hand, so that you do not need to accumulate orders in an array.

I have edited the list function of jhs to avoid any use of arrays, so you can have customers with any number of orders.

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  var first_customer = true
    , first_order = true
    , row
    ;

  send('{"rows":[');

  while(row = getRow()) {
    if(row.key[1] === 2) {
      // Order for customer
      if (first_order) {
        first_order = false;
      } else {
        send(',');
      }
      send(JSON.stringify(row.value));
    }
    else if (row.key[1] === 1) {
      // New customer
      if (first_customer) {
        first_customer = false;
      } else {
        send(']},');
      }
      send('{"customer":');
      send(JSON.stringify(row.key[0]));
      send(',"orders":[');
      first_order = true;
    }
  }
  if (!first_customer)
    send(']}');

  send('\n]}');
}

Option 2: Optimize the documents for your use case

If you really need to have the orders in the same document, then ask yourself if you can store it this way and avoid any processing while querying.

In other words: try to fully exploit the possibilities offered by a document database. Design the documents to best fit your use case, and reduce the post-processing needed to use them.

回复收藏 0 原文

紫竹語嫣☆ 2024-11-15 18:00:20

CouchDB 的主要“观点”之一是它只做在分布式、集群环境中也可以做的事情。实际上，这在开始时意味着一些不便，但稍后会在不更改代码的情况下获得高可扩展性的回报。

换句话说，“加入”问题没有完美的答案。但我认为有两个非常好的选择。

我正在使用这个数据集：

$ curl localhost:5984/so/_bulk_docs -XPOST -Hcontent-type:application/json -d @-
{"docs":[
{"type":"customer","name":"Jason"},
{"type":"customer","name":"Hunter"},
{"type":"customer","name":"Smith"},
{"type":"order", "for":"Jason", "desc":"Hat"},
{"type":"order", "for":"Jason", "desc":"Shoes"},
{"type":"order", "for":"Smith", "desc":"Pan"}
]}
^D

[{"id":"4cb766ebafda06d8a3a7382f74000b46","rev":"1-8769ac2fffb869e795c347e7b8c653bf"},
{"id":"4cb766ebafda06d8a3a7382f74000b7d","rev":"1-094eff3e3a5967d974fcd7b3cfd7e454"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","rev":"1-5cda0b61da4c045ff503b57f614454d5"},
{"id":"4cb766ebafda06d8a3a7382f7400239d","rev":"1-50642a9809f15283a9d938c8fe28ef27"},
{"id":"4cb766ebafda06d8a3a7382f74002778","rev":"1-d03d883fb14a424e3db022350b38c510"},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","rev":"1-e9612f5d267a8442d3fc2ae09e8c800d"}]

我的地图函数是

function(doc) {
  if(doc.type == 'customer')
    emit([doc.name, 1], "");
  if(doc.type == 'order')
    emit([doc.for, 2], doc.desc);
}

查询完整视图显示：

{"total_rows":6,"offset":0,"rows":[
{"id":"4cb766ebafda06d8a3a7382f74000b7d","key":["Hunter",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74000b46","key":["Jason",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f7400239d","key":["Jason",2],"value":"Hat"},
{"id":"4cb766ebafda06d8a3a7382f74002778","key":["Jason",2],"value":"Shoes"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","key":["Smith",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","key":["Smith",2],"value":"Pan"}
]}

选项 1：列表函数来收集结果

好处是，如果您要求 10 行，您肯定会得到 10 行（除非没有）当然有足够的数据）。

但代价是你必须对每个查询进行服务器端处理。您想要的数据只是位于磁盘上，准备好流式传输给您，但现在您已经让它通过了这个瓶颈。

但是，我个人认为，除非您有明显的性能问题，否则 _list 很好。

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  send('{"rows":');

  var customer = null, orders = [], count = 0;

  var prefix = '\n[ ';
  function show_orders() {
    if(customer && orders.length > 0) {
      count += 1;

      send(prefix);
      prefix = '\n, ';

      send(JSON.stringify({'customer':customer, 'orders':orders}));
    }
  }

  function done() {
    send('\n]}');
  }

  var row;
  while(row = getRow()) {
    if(row.key[1] == 2) {
      // Order for customer
      orders.push(row.value);
    }

    if(row.key[1] == 1) {
      // New customer
      show_orders();

      if(req.query.lim && count >= parseInt(req.query.lim)) {
        // Reached the limit
        done();
        return;
      } else {
        // Prepare for this customer.
        customer = row.key[0];
        orders = [];
      }
    }
  }

  // Show the last order set seen and finish.
  show_orders();
  done();
}

此函数只是循环遍历 map 行，并且仅在收集所有信息后输出完整的 customer+orders 行。显然，您可以更改输出的 JSON 格式。另外，还有一个 ?lim=X 参数，因为使用参数 limit 会干扰地图查询。

危险在于该函数会在内存中构建无限的响应。如果客户下了 10,000 个订单怎么办？还是十万？最终构建 orders 数组将会失败。这就是 CouchDB 将它们保留在“高”列表中的原因。如果您的每位客户永远不会收到 10,000 个订单，那么这不是问题。

$ curl 'http://localhost:5984/so/_design/ex/_list/ex/so?reduce=false&lim=2'
{"rows":
[ {"customer":"Jason","orders":["Hat","Shoes"]}
, {"customer":"Smith","orders":["Pan"]}
]}

选项2：偷偷摸摸的reduce

您可以使用reduce 函数执行类似的操作。在这里，我会警告您，这在技术上是不可扩展的，因为您在磁盘上累积响应，但我个人更喜欢它而不是 _list，因为代码更简单，而且我知道我正在直接读取数据磁盘，无需后期处理。

function(keys, vals, re) {
  // If all keys are the same, then these are all
  // orders for the same customer, so accumulate
  // them. Otherwise, return something meaningless.
  var a;

  var first_customer = keys[0][0][0];
  for(a = 0; a < keys.length; a++)
    if(keys[a][0][0] !== first_customer)
      return null;

  var result = [];
  for(a = 0; a < vals.length; a++)
    if(vals[a]) {
      // Accumulate an order.
      result.push(vals[a]);
    }
  return result;
}

始终使用 ?group_level=1 查询此视图，它将按客户对结果进行分段（因为客户名称是 map 键中的第一项）。

这是违法的，因为您不应该在reduce阶段积累数据。这就是为什么他们称之为减少。

然而，CouchDB 很轻松，只要您不构建巨大的列表，它就应该可以工作，而且更加优雅。

$ curl 'localhost:5984/so/_design/ex/_view/so?group_level=1&limit=3'
{"rows":[
{"key":["Hunter"],"value":[]},
{"key":["Jason"],"value":["Shoes","Hat"]},
{"key":["Smith"],"value":["Pan"]}
]}

祝你好运！

One of CouchDB'm main "opinions" is that it only does things that are also possible in a distributed, clustered, setting. In practice, this means some inconvenience at the beginning, with a pay-off later of big scalability without changing the code.

In other words, there is no perfect answer to the "join" question. But I think there are two pretty good options.

I'm working with this data set:

$ curl localhost:5984/so/_bulk_docs -XPOST -Hcontent-type:application/json -d @-
{"docs":[
{"type":"customer","name":"Jason"},
{"type":"customer","name":"Hunter"},
{"type":"customer","name":"Smith"},
{"type":"order", "for":"Jason", "desc":"Hat"},
{"type":"order", "for":"Jason", "desc":"Shoes"},
{"type":"order", "for":"Smith", "desc":"Pan"}
]}
^D

[{"id":"4cb766ebafda06d8a3a7382f74000b46","rev":"1-8769ac2fffb869e795c347e7b8c653bf"},
{"id":"4cb766ebafda06d8a3a7382f74000b7d","rev":"1-094eff3e3a5967d974fcd7b3cfd7e454"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","rev":"1-5cda0b61da4c045ff503b57f614454d5"},
{"id":"4cb766ebafda06d8a3a7382f7400239d","rev":"1-50642a9809f15283a9d938c8fe28ef27"},
{"id":"4cb766ebafda06d8a3a7382f74002778","rev":"1-d03d883fb14a424e3db022350b38c510"},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","rev":"1-e9612f5d267a8442d3fc2ae09e8c800d"}]

And my map function is

function(doc) {
  if(doc.type == 'customer')
    emit([doc.name, 1], "");
  if(doc.type == 'order')
    emit([doc.for, 2], doc.desc);
}

Querying the full view shows:

{"total_rows":6,"offset":0,"rows":[
{"id":"4cb766ebafda06d8a3a7382f74000b7d","key":["Hunter",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74000b46","key":["Jason",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f7400239d","key":["Jason",2],"value":"Hat"},
{"id":"4cb766ebafda06d8a3a7382f74002778","key":["Jason",2],"value":"Shoes"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","key":["Smith",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","key":["Smith",2],"value":"Pan"}
]}

Option 1: List function to gather results

The benefit is, if you ask for 10 rows, you will definitely get 10 (unless there isn't enough data of course).

But the cost is you have to do server-side processing for every query. The data you want was just sitting there on disk, ready to stream out to you, but now you've put it through this bottleneck.

However, I personally feel that unless you have a demonstrated performance issue, the _list is nice.

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  send('{"rows":');

  var customer = null, orders = [], count = 0;

  var prefix = '\n[ ';
  function show_orders() {
    if(customer && orders.length > 0) {
      count += 1;

      send(prefix);
      prefix = '\n, ';

      send(JSON.stringify({'customer':customer, 'orders':orders}));
    }
  }

  function done() {
    send('\n]}');
  }

  var row;
  while(row = getRow()) {
    if(row.key[1] == 2) {
      // Order for customer
      orders.push(row.value);
    }

    if(row.key[1] == 1) {
      // New customer
      show_orders();

      if(req.query.lim && count >= parseInt(req.query.lim)) {
        // Reached the limit
        done();
        return;
      } else {
        // Prepare for this customer.
        customer = row.key[0];
        orders = [];
      }
    }
  }

  // Show the last order set seen and finish.
  show_orders();
  done();
}

This function simply loops through the map rows and only outputs the full customer+orders row once all the information is collected. Obviously you can change the JSON format that you output. Also, there is a ?lim=X parameter, because using the parameter limit would interfere with the map query.

The danger is that this function builds an unlimited response in memory. What if a customer made 10,000 orders? Or 100,000? Eventually building the orders array will fail. That is why CouchDB keeps them in a "tall" list. If you won't ever get 10,000 orders per customer, then this isn't a problem.

$ curl 'http://localhost:5984/so/_design/ex/_list/ex/so?reduce=false&lim=2'
{"rows":
[ {"customer":"Jason","orders":["Hat","Shoes"]}
, {"customer":"Smith","orders":["Pan"]}
]}

Option 2: A sneaky reduce

You can do something similar with a reduce function. Right here, I'll warn you this is technically not scalable because you accumulate a response on disk, however I personally prefer it instead of _list because the code is simpler and I know I am directly reading data off disk, without post-processing.

function(keys, vals, re) {
  // If all keys are the same, then these are all
  // orders for the same customer, so accumulate
  // them. Otherwise, return something meaningless.
  var a;

  var first_customer = keys[0][0][0];
  for(a = 0; a < keys.length; a++)
    if(keys[a][0][0] !== first_customer)
      return null;

  var result = [];
  for(a = 0; a < vals.length; a++)
    if(vals[a]) {
      // Accumulate an order.
      result.push(vals[a]);
    }
  return result;
}

Always query this view with ?group_level=1 which will segment the results by customer (because the customer name was the first item in the map key).

This is against the law because you are not supposed to accumulate data during a reduce phase. That's why they call it reduce.

However, CouchDB is relaxed, and as long as you don't build giant lists, it should work and it's much more elegant.

$ curl 'localhost:5984/so/_design/ex/_view/so?group_level=1&limit=3'
{"rows":[
{"key":["Hunter"],"value":[]},
{"key":["Jason"],"value":["Shoes","Hat"]},
{"key":["Smith"],"value":["Pan"]}
]}

Good luck!

回复收藏 0 原文

~没有更多了~