CouchDB 文档建模原则
我有一个问题,我已经尝试回答一段时间了,但无法弄清楚:
如何设计或划分 CouchDB 文档?
以博客文章为例。
半“关系”方法是创建一些对象:
- Post
- User
- Comment
- Tag
- Snippet
这很有意义。但我正在尝试使用 couchdb(出于所有原因,它很棒)来建模相同的事物,并且这是非常困难的。
大多数博客文章都为您提供了如何执行此操作的简单示例。他们基本上以相同的方式划分它,但说你可以向每个文档添加“任意”属性,这绝对是好的。所以你在 CouchDB 中会有这样的东西:
- 发布(在文档中带有标签和片段“伪”模型)
- 评论
- 用户
有些人甚至会说你可以把评论和用户放在那里,所以你会有这样的:
post {
id: 123412804910820
title: "My Post"
body: "Lots of Content"
html: "<p>Lots of Content</p>"
author: {
name: "Lance"
age: "23"
}
tags: ["sample", "post"]
comments {
comment {
id: 93930414809
body: "Interesting Post"
}
comment {
id: 19018301989
body: "I agree"
}
}
}
这看起来非常好而且很容易理解。我还了解如何编写从所有帖子文档中提取评论的视图,将它们放入评论模型中,与用户和标签相同。
但后来我想,“为什么不把我的整个网站放入一个文档中呢?”:
site {
domain: "www.blog.com"
owner: "me"
pages {
page {
title: "Blog"
posts {
post {
id: 123412804910820
title: "My Post"
body: "Lots of Content"
html: "<p>Lots of Content</p>"
author: {
name: "Lance"
age: "23"
}
tags: ["sample", "post"]
comments {
comment {
id: 93930414809
body: "Interesting Post"
}
comment {
id: 19018301989
body: "I agree"
}
}
}
post {
id: 18091890192984
title: "Second Post"
...
}
}
}
}
}
您可以轻松地创建视图来找到您想要的内容。
那么我的问题是,如何确定何时将文档分成更小的文档,或者何时在文档之间建立“RELATIONS”?
我认为如果像这样划分的话,它会更加“面向对象”,并且更容易映射到值对象:
posts {
post {
id: 123412804910820
title: "My Post"
body: "Lots of Content"
html: "<p>Lots of Content</p>"
author_id: "Lance1231"
tags: ["sample", "post"]
}
}
authors {
author {
id: "Lance1231"
name: "Lance"
age: "23"
}
}
comments {
comment {
id: "comment1"
body: "Interesting Post"
post_id: 123412804910820
}
comment {
id: "comment2"
body: "I agree"
post_id: 123412804910820
}
}
...但随后它开始看起来更像关系数据库。很多时候,我继承的东西看起来像“文档中的整个站点”,因此用关系对其进行建模更加困难。
我读过很多关于如何/何时使用关系数据库与文档数据库的文章,所以这不是这里的主要问题。我更想知道,在 CouchDB 中建模数据时应用什么好的规则/原则。
另一个例子是 XML 文件/数据。一些 XML 数据的嵌套深度超过 10 层,我希望使用相同的客户端(例如 Ajax on Rails 或 Flex)来可视化,我将从 ActiveRecord、CouchRest 或任何其他对象关系映射器呈现 JSON。有时我会得到包含整个站点结构的巨大 XML 文件,如下所示,我需要将其映射到值对象以在我的 Rails 应用程序中使用,这样我就不必编写另一种序列化/反序列化数据的方法:
<pages>
<page>
<subPages>
<subPage>
<images>
<image>
<url/>
</image>
</images>
</subPage>
</subPages>
</page>
</pages>
因此,一般的 CouchDB 问题是:
- 您使用什么规则/原则来划分文档(关系等)?
- 可以将整个网站放入一个文档中吗?
- 如果是这样,如何处理具有任意深度级别的序列化/反序列化文档(如上面的大型 json 示例或 xml 示例)?
- 或者您不将它们转换为 VO,您只是决定“这些过于嵌套到对象关系映射,所以我将使用原始 XML/JSON 方法访问它们”?
非常感谢您的帮助,如何使用 CouchDB 划分数据的问题对我来说很难说“从现在开始我应该这样做”。我希望尽快到达那里。
我研究了以下网站/项目。
- CouchDB 中的分层数据
- CouchDB Wiki
- 沙发 - CouchDB应用
- CouchDB 权威指南
- PeepCode CouchDB 截屏
- CouchRest
- < a href="http://svn.apache.org/repos/asf/couchdb/trunk/README" rel="noreferrer">CouchDB README
...但他们仍然没有回答这个问题。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对此已经有一些很好的答案,但我想在选项组合中添加一些更新的 CouchDB 功能,以处理 viatropos 描述的原始情况。
拆分文档的关键点是可能存在冲突的地方(如前所述)。您永远不应该将大量“混乱”的文档保存在一个文档中,因为您将获得完全不相关的更新的单个修订路径(例如,添加注释以添加对整个站点文档的修订)。管理各种较小文档之间的关系或连接一开始可能会令人困惑,但 CouchDB 提供了多种选项来将不同的部分组合成单个响应。
第一个大问题是视图整理。当您将键/值对发送到 Map/Reduce 查询的结果中时,键将根据 UTF-8 排序规则进行排序(“a”位于“b”之前)。您还可以将 Map/Reduce 中的复杂键输出为 JSON 数组:
["a", "b", "c"]
。这样做将允许您包含由数组键构建的排序“树”。使用上面的示例,我们可以输出 post_id,然后输出我们引用的事物的类型,然后输出其 ID(如果需要)。如果我们随后将引用文档的 id 输出到返回值中的对象中,我们可以使用“include_docs”查询参数将这些文档包含在 Map/Reduce 输出中:使用“?include_docs=true”请求相同的视图将添加一个“doc”键,该键将使用“value”对象中引用的“_id”,或者如果“value”对象中不存在该键,则它将使用该行所在文档的“_id”发出(在本例中为“发布”文档)。请注意,这些结果将包含一个“id”字段,该字段引用发出发出的源文档。为了空间和可读性,我把它留了下来。
然后,我们可以使用“start_key”和“end_key”参数将结果过滤为单个帖子的数据:
Or even specifically extract the list for a certain type:
These query param combinations are possible because an empty object ("
{}
") is always at the bottom of the collation and null or "" are always at the top.在这些情况下,CouchDB 添加的第二个有用的功能是 _list 函数。这将允许您通过某种模板系统运行上述结果(如果您想要 HTML、XML、CSV 或其他任何形式),或者如果您希望能够请求整个帖子的内容(包括作者和评论数据)通过单个请求并作为与您的客户端/UI 代码所需的内容相匹配的单个 JSON 文档返回。这样做将允许您通过以下方式请求帖子的统一输出文档:
Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).
结合这些内容,您可以在您认为对更新、冲突和复制有用且“安全”的任何级别拆分文档,然后在需要时根据需要将它们重新组合在一起。
希望有帮助。
There have been some great answers to this already, but I wanted to add some more recent CouchDB features to the mix of options for working with the original situation described by viatropos.
The key point at which to split up documents is where there might be conflicts (as mentioned earlier). You should never keep massively "tangled" documents together in a single document as you'll get a single revision path for completely unrelated updates (comment addition adding a revision to the entire site document for instance). Managing the relationships or connections between various, smaller documents can be confusing at first, but CouchDB provides several options for combining disparate pieces into single responses.
The first big one is view collation. When you emit key/value pairs into the results of a map/reduce query, the keys are sorted based on UTF-8 collation ("a" comes before "b"). You can also output complex keys from your map/reduce as JSON arrays:
["a", "b", "c"]
. Doing that would allow you to include a "tree" of sorts built out of array keys. Using your example above, we can output the post_id, then the type of thing we're referencing, then its ID (if needed). If we then output the id of the referenced document into an object in the value that's returned we can use the 'include_docs' query param to include those documents in the map/reduce output:Requesting that same view with '?include_docs=true' will add a 'doc' key that will either use the '_id' referenced in the 'value' object or if that isn't present in the 'value' object, it will use the '_id' of the document from which the row was emitted (in this case the 'post' document). Please note, these results would include an 'id' field referencing the source document from which the emit was made. I left it out for space and readability.
We can then use the 'start_key' and 'end_key' parameters to filter the results down to a single post's data:
Or even specifically extract the list for a certain type:
These query param combinations are possible because an empty object ("
{}
") is always at the bottom of the collation and null or "" are always at the top.The second helpful addition from CouchDB in these situations is the _list function. This would allow you to run the above results through a templating system of some kind (if you want HTML, XML, CSV or whatever back), or output a unified JSON structure if you want to be able to request an entire post's content (including author and comment data) with a single request and returned as a single JSON document that matches what your client-side/UI code needs. Doing that would allow you to request the post's unified output document this way:
Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).
Combining these things, you can split up your documents at whatever level you find useful and "safe" for updates, conflicts, and replication, and then put them back together as needed when they're requested.
Hope that helps.
我知道这是一个老问题,但我遇到它试图找出解决这个完全相同问题的最佳方法。 Christopher Lenz 写了一篇关于CouchDB 中“连接”建模方法。我的收获之一是:“允许无冲突地添加相关数据的唯一方法是将相关数据放入单独的文档中。”因此,为了简单起见,您希望倾向于“非规范化”。但在某些情况下,由于写入冲突,您会遇到自然障碍。
在您的帖子和评论示例中,如果单个帖子及其所有评论都位于一个文档中,那么两个人尝试同时发布评论(即针对文档的同一修订版)将导致冲突。在“单个文档中的整个站点”场景中,情况会变得更糟。
因此,我认为经验法则是“非规范化,直到造成伤害”,但它会“伤害”的点是,您很可能针对同一文档修订版发布多次编辑。
I know this is an old question, but I came across it trying to figure out the best approach to this exact same problem. Christopher Lenz wrote a nice blog post about methods of modeling "joins" in CouchDB. One of my take-aways was: "The only way to allow non-conflicting addition of related data is by putting that related data into separate documents." So, for simplicity sake you'd want to lean towards "denormalization". But you'll hit a natural barrier due to conflicting writes in certain circumstances.
In your example of Posts and Comments, if a single post and all of its comments lived in one document, then two people trying to post a comment at the same time (i.e. against the same revision of the document) would cause a conflict. This would get even worse in your "whole site in a single document" scenario.
So I think the rule of thumb would be "denormalize until it hurts", but the point where it will "hurt" is where you have a high likelihood of multiple edits being posted against the same revision of a document.
书说,如果我没记错的话,要反规范化,直到“它伤害”,同时记住文档可能更新的频率。
根据经验,我会包含显示有关相关项目的页面所需的所有数据。换句话说,您将在现实世界的纸上打印并交给某人的所有内容。例如,股票报价文件除了数字之外还包括公司名称、交易所、货币;合同文件将包括交易对手方的姓名和地址、日期和签字人的所有信息。但不同日期的股票报价将形成单独的文件,单独的合同将形成单独的文件。
不,那很愚蠢,因为:
The book says, if I recall correctly, to denormalize until "it hurts", while keeping in mind the frequency with which your documents might be updated.
As a rule of thumb, I include all data that is needed to display a page regarding the item in question. In other words, everything you would print on a real-world piece of paper that you would hand to somebody. E.g. a stock quote document would include the name of the company, the exchange, the currency, in addition to the numbers; a contract document would include the names and addresses of the counterparties, all information on dates and signatories. But stock quotes from distinct dates would form separate documents, separate contracts would form separate documents.
No, that would be silly, because:
我认为 Jake 的回答指出了使用 CouchDB 的最重要的方面之一,它可以帮助您做出范围界定决定:冲突。
如果您将评论作为帖子本身的数组属性,并且您只有一个“帖子”数据库,其中包含一堆巨大的“帖子”文档,正如杰克和其他人正确指出的那样,您可以想象一个场景一篇非常受欢迎的博客文章,其中两个用户同时向帖子文档提交编辑,导致该文档发生冲突和版本冲突。
旁白:正如本文指出的,还请考虑每个当您请求/更新该文档时,您必须获取/设置整个文档,因此传递代表整个网站的大量文档或包含大量评论的帖子可能会成为您想要的问题避免。
在帖子与评论分开建模并且两个人提交对一个故事的评论的情况下,它们只是成为该数据库中的两个“评论”文档,不存在冲突问题;只需两个 PUT 操作即可将两个新评论添加到“评论”数据库。
然后,要编写返回帖子评论的视图,您需要传入 postID,然后发出引用该父帖子 ID 的所有评论,并按某种逻辑顺序排序。也许您甚至可以传递类似 [postID,byUsername] 的内容作为“评论”视图的键,以指示父帖子以及您希望如何对结果进行排序或类似的内容。
MongoDB 处理文档的方式有点不同,允许在文档的子元素上构建索引,因此您可能会在 MongoDB 邮件列表上看到相同的问题,并且有人说“只需将评论作为父帖子的属性”。
由于 Mongo 的写锁定和单主特性,两个人添加注释的冲突修订问题不会出现在那里,并且如上所述,内容的查询能力不会因为子问题而受到太差的影响。索引。
话虽这么说,如果您在任一数据库中的子元素将会很大(例如数十条评论),我相信两个阵营都建议制作这些单独的元素;我确实看到了 Mongo 的情况,因为文档及其子元素的大小有一些上限。
I think Jake's response nails one of the most important aspects of working with CouchDB that may help you make the scoping decision: conflicts.
In the case where you have comments as an array property of the post itself, and you just have a 'post' DB with a bunch of huge 'post' documents in it, as Jake and others correctly pointed out you could imagine a scenario on a really popular blog post where two users submit edits to the post document simultaneously, resulting in a collision and a version conflict for that document.
ASIDE: As this article points out, also consider that each time you are requesting/updating that doc you have to get/set the document in its entirety, so passing around a massive documents that either represent the entire site or a post with a lot of comments on it can become a problem you would want to avoid.
In the case where posts are modeled separately from comments and two people submit a comment on a story, those simply become two "comment" documents in that DB, with no issue of conflict; just two PUT operations to add two new comments to the "comment" db.
Then to write the views that give you back the comments for a post, you would pass in the postID and then emit all the comments that reference that parent post ID, sorted in some logical ordering. Maybe you even pass in something like [postID,byUsername] as the key to the 'comments' view to indicate the parent post and how you want the results sorted or something along those lines.
MongoDB handles documents a bit differently, allowing indexes to be built on sub-elements of a document, so you might see the same question on the MongoDB mailing list and someone saying "just make the comments a property of the parent post".
Because of the write locking and single-master nature of Mongo, the conflicting revision issue of two people adding comments wouldn't spring up there and the query-ability of the content, as mentioned, isn't effected too poorly because of sub-indexes.
That being said, if your sub-elements in either DB are going to be huge (say 10s of thousands of comments) I believe it is the recommendation of both camps to make those separate elements; I have certainly seen that to be the case with Mongo as there are some upper bound limits on how big a document and its subelements can be.