CouchDB - 每组的前 N 个文档
我目前正在通过我们在 Web 项目中遇到的几个常见用例来评估 CouchDB。
这些用例之一如下:
考虑一个包含(人为示例)的系统:
- 文章
- 问题
- 主题
文章和问题可以分配给多个主题。
主题有自己的页面(想想 http://www.quora.com 主题)。
是否可以通过 couchdb 的 1 个查询同时获取:
- 有关主题 X 的最新 N 篇文章
- 和有关主题 X 的最新 N(或 M?)问题
更通用的术语:我正在寻找一种按类型进行分组的方法(在本例中, type = 'article' 或 'question' )并且对于每个组,返回前 n 个文档,给定特定的排序(在本例中排序是按时间倒序排列),并限制于特定的过滤器(在本例中为主题) 'X')
从我读到的内容来看,从性能的角度来看,并行执行多个 couchdb 查询通常没什么大不了的,但我只是好奇这个(对于我们经常使用的)用例是否可以在一个单个请求。
感谢您的任何见解
I'm evaluating CouchDB at the moment, by walking along a couple of common use-cases we will encounter in our webproject.
One of these use-cases is the following:
Consider a system containing (contrived example):
- articles
- questions
- topics
articles and questions can be assigned to multiple topics.
A topic has it's own page (think of http://www.quora.com topics).
Is it possible with 1 query from couchdb to get BOTH:
- the latest N articles on topic X
- AND the latest N (or M?) questions on topic X
In more generic terms: I'm looking for a way to do a group by type (where, in this case, type = 'article' or 'question' ) and for each group return the top n documents given a certain sort (in this case sort is reverse chronological) constrained to a specific filter (in this case the topic 'X')
From what I've read, it's often not that big a deal to do multiple couchdb-queries in parallel, from a performance standpoint, but I'm just curious if this (for us often used ) use-case can be done in a single request.
Thanks for any insight
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不会。CouchDB
视图是一维的。对于给定主题,最新的文章和最新的问题是二维查询,因此不可能在一个 HTTP 请求中实现。
关于解决方法的思考
CouchDB 是为并发查询而设计的,并鼓励并发查询。在生产中,我会同时根据其他答案进行两个查询。 (在 Javascript 中,这非常简单,但任何异步或线程编程语言都可以做到。)
接收两个结果的响应时间将只是较长结果的响应时间(即,首先完成的是“免费”)。您甚至可以遍历两个响应的行,以 O(1) 空间和 O(n) 时间合并它们的时间线 - 还不错!
CouchDB 唯一不保证的是两个查询代表完全相同的数据库状态的快照。您提到了 Quora,这是现代数据库需求的完美示例。 理论上,您不知道这两个查询之间数据库状态发生了多少变化。一般来说,你不知道一种观点与另一种观点相比是否有意义。 在实践中,答案是显而易见的:谁在乎?实际上,仅以毫秒为间隔的查询在一起就非常有意义。这就是为什么 CouchDB 非常适合 Web 应用程序,尽管其功能集受到严格限制。
替代解决方案:GeoCouch
GeoCouch 扩展实际上是一个通用的二维边界框查询引擎。此外,显然,地理空间数据还可以用于查询存储为
timestamp
xseverity
2-space 的日志。然而,它目前仍然是一个独立于 CouchDB 的项目,所以我不愿意将其称为“CouchDB 查询”。No.
CouchDB views are 1-dimensional. For a given topic, the most recent articles AND the most recent questions is a two-dimensional query and thus not possible in one HTTP request.
Thoughts on a workaround
CouchDB is architected for, and encourages concurrent queries. In production, I would make two queries from my other answer concurrently. (In Javascript, this is very easy, but any asynchronous or threaded programming language can do it.)
The response time to receive both results will only be the response time of the longer result (i.e. the one that finishes first was "free"). You can even walk the rows of both responses to merge their timelines in O(1) space and O(n) time—not too bad!
The only thing that CouchDB does not guarantee is that both queries represent snapshots of the exact same database state. You mention Quora and that is a perfect example of modern database requirements. In theory, you have no idea how much database state has changed between these two queries. You have no idea, generally, if one view makes any sense compared to the other. In practice, the answer is obvious: Who cares? Queries separated by mere milliseconds will, in reality, make perfect sense together. That is why CouchDB is well-suited for web-applications despite having a severely restricted feature set.
Alternative solution: GeoCouch
The GeoCouch extension is actually a general-purpose 2-dimensional bounding box query engine. Besides, obviously, geospatial data, it can be used, for example, to query logs stored as a
timestamp
xseverity
2-space. However it is currently still a separate project from CouchDB so I would be reluctant to call it a "CouchDB query."可以通过 CouchDB 的 1 个查询来获取两者。两个查询都使用映射/归约查询,尽管您不需要归约函数。
您需要视图行具有
[$type, $topic, $timestamp]
对作为键:此函数可能会这样做:
要查找最新的 N 行(无论是文章还是问题),您基本上需要按降序匹配
[$type, $topic, *]
的行。例如,对于主题X的最新N篇文章,可以这样分解。 (请注意,null
是 CouchDB 中的最小值,对象{}
是最大的值。)descending=true
获取逆时间顺序。 (注意“降序”在概念上意味着沙发从行的“底部”到“顶部”扫描。因此 startkey 和 endkey 是相反的。)startkey=["articles"," X",{}]
,所以这是从时间结束时开始关于X的文章endkey=["articles","X" ,null]
,这是关于X的相同文章以时间limit=N
开始结尾,以减少结果查询因此看起来像这样(记住如有必要,对 URL 进行编码)。
It is possible with 1 query from CouchDB to get both. Both queries use a map/reduce query, although you do not need the reduce function.
You need the view rows to have
[$type, $topic, $timestamp]
pairs for keys:This function might do that:
To find the latest N rows (whether articles or questions), you basically need rows matching
[$type, $topic, *]
in descending order. For example, for the latest N articles on topic X, that breaks down like this. (Note thatnull
is the smallest value in CouchDB and the object{}
is the largest.)descending=true
to get reverse chronological order. (Note "descending" conceptually means couch is scanning from the "bottom" to the "top" of the rows. So startkey and endkey are reversed.)startkey=["articles","X",{}]
, so this is articles about X starting from the end of timeendkey=["articles","X",null]
, this is the same articles about X ending with the beginning of timelimit=N
, to trim the results downThe query would thus look like this (remember to encode the URL if necessary).