基于文档的nosql(mongodb、couchdb、riak等)查询关系型数据的性能
为了跟进我关于使用 nosql 建模关系数据的问题,我阅读了几篇关于该主题的文章:
他们似乎表明 nosql 可以处理规范化的关系数据。
那么让我们继续之前的例子,一个 CMS 系统,有两种类型的数据:文章和作者,其中文章有作者的引用(通过 ID)。
以下是系统需要支持的操作:
- 按 id 获取文章以及作者
- 获取特定作者的所有文章
- 查找作者按创建日期排序的前 10 篇文章
我想了解性能与相同数据存储在 RDBMS 上的相同操作进行比较。 特别是,请指定操作是否使用 MapReduce、需要多次访问 nosql 存储(链接)或 预加入
我想限制讨论基于文档 nosql 解决方案,如 mongodb, couchdb 和 riak。
编辑 1:
Spring-data 项目 在 Riak 和 Mongodb 上可用
To follow up on my question on modeling relational data with nosql, I have read several articles on the subject:
Nosql doesn't mean non-relational
They seem to suggest that nosql can handle normalized, relational data.
So let's continue with the example I had before, a CMS system that have two types of data: article and authors, where article has an reference (by ID) to author.
Below are the operations the system needs to support:
- Fetch a article by id along with the author
- Fetch all articles by particular author
- Find the first 10 article(s) with the author(s) sorted by creation date
I would like to understand the performance of these operation when compare to the same operation if the same data were stored on RDBMS. In particular, please specify if the operation uses MapReduce, require multple trips to the nosql store (Links), or pre-join
I would like to limit to discussion to document-based nosql solution like mongodb, couchdb, and riak.
Edit 1:
Spring-data project is avalible on Riak and Mongodb
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只是想为任何可能好奇的人提供一个 CouchDB 答案。 :)
正如上面第一个答案中提到的,将作者文档嵌入到文章文档中是不明智的,因此下面的示例假设两种文档类型:文章和作者。
CouchDB 使用通常用 JavaScript 编写的 MapReduce 查询(但也可以使用 Python、Ruby、Erlang 等)。 MapReduce 查询的结果在第一次请求时存储在索引中,并且该存储的索引用于将来的所有查找。根据进一步的请求,对数据库的更改将添加到索引中。
CouchDB 的 API 完全基于 HTTP,因此对数据库的所有请求都是各种 URL 上的 HTTP 动词(GET、POST、PUT、DELETE)。我将列出 MapReduce 查询(用 JavaScript 编写)以及用于从索引请求相关结果的 URL。
1.通过 id 和作者获取文章
最简单的方法是两次直接文档查找:
...其中 {author_id} 是从文章的author_id 字段获取的值。
2.获取特定作者的所有文章
MapReduce
...其中 {author_id} 是作者的实际 ID。
3.查找作者按创建日期排序的前 10 篇文章
MapReduce
可以在视图请求中使用
?include_docs=true
在 CouchDB 中包含样式“joins” 。如果您在发出的值一侧包含“_id”键(第二个参数),则将include_docs=true
添加到您的查询参数中将包含由指定“_id”引用的文档在上面的情况下,我们将文档自己的“_id”(我们不再需要)替换为引用作者的“_id”(文章文档中“author_id”的值)。请求前 10 篇文章及其相关作者信息如下所示:请求该 URL 将返回最近 10 篇文章的列表,其格式类似于:
使用相同的索引,您可以获得任何年份的所有文档的列表,月、日、小时等粒度,有或没有作者数据。
还有一些方法可以使用视图排序规则将多个文档聚合在一起形成一个文档(例如 CMS 中引用不同内容的页面)。在我 7 月份为 CouchConf 制作的这些幻灯片中,有一些关于如何做到这一点的信息: http://www.slideshare .net/Couchbase/couchconfsfdesigningcouchbasedocuments
如果您有任何其他问题,请告诉我。
Just wanted to toss in a CouchDB answer for anyone who might be curious. :)
As mentioned in the first answer above, embedding the author document into the article document is unwise, so the examples below assume two document types: articles and authors.
CouchDB uses MapReduce queries typically written in JavaScript (but Python, Ruby, Erlang and others are availble). The results of a MapReduce query are stored in an index upon their first request and that stored index is used to for all future look-ups. Changes to the database are added to the index upon further requests.
CouchDB's API is completely HTTP-based, so all requests to the database are HTTP verbs (GET, POST, PUT, DELETE) at various URLs. I'll be listing both the MapReduce queries (written in JavaScript) along with the URL used to request related results from the index.
1. Fetch a article by id along with the author
The simplest method for doing this is two direct document lookups:
...where {author_id} is the value obtained from the article's author_id field.
2. Fetch all articles by particular author
MapReduce
...where {author_id} is the actual ID of the author.
3. Find the first 10 article(s) with the author(s) sorted by creation date
MapReduce
It's possible to do include style "joins" in CouchDB using
?include_docs=true
in a view request. If you include a "_id" key in the value side of the emit (the second argument), then addinginclude_docs=true
to your query parameters will include the doc referenced by the specified "_id" In the case above, we're replacing the document's own "_id" (which we don't need anymore) with the referenced author's "_id" (the value of "author_id" in the article document). Requesting the top 10 articles with their related author info looks like this:Requesting that URL will return a list of the most recent 10 articles in a format similar to:
Using this same index you can get a list of all the documents any any year, month, day, hour, etc granularity with or without author data.
There are also methods for using view collation to aggregate several documents together out of a single document (like a page in a CMS referencing disparate content). There's some info on how to do that in these slides I did for CouchConf in July: http://www.slideshare.net/Couchbase/couchconfsfdesigningcouchbasedocuments
If you have any other questions, please let me know.
SQL:
MongoDB:
SQL:
MongoDB:
SQL:
MongoDB:
articles.find().sort().limit(10)
,authors.find({$in:[article_authors]})
摘要
需要额外的查询,但在某些情况下,MongoDB 通过网络返回的数据较少(没有重复的条目)。所有要加入的数据都位于同一个盒子上。如果作者和文章位于不同的位置,那么无论如何,
MongoDB 往往会获得更好的“原始”性能,因为它不会在每次写入时刷新到磁盘(所以它实际上是一个“持久性”权衡)。它还有一个更小的查询解析器,因此每个查询的活动更少。
从基本性能的角度来看,这些事情非常相似。他们只是对您的数据和您想要做出的权衡做出不同的假设。
SQL:
MongoDB:
SQL:
MongoDB:
SQL:
MongoDB:
articles.find().sort().limit(10)
,authors.find({$in:[article_authors]})
Summary
In two cases MongoDB requires an extra query, but does most of the same total work underneath. In some cases MongoDB returns less data over the network (no repeated entries). The join queries tend to be limited by the requirement that all the data to join live on the same box. If Authors and Articles live in different places, then you end up doing two queries anyways.
MongoDB tends to get better "raw" performance because it doesn't flush to disk with every write (so it's actually a "durability" tradeoff). It also has a much smaller query parser, so there's less activity per query.
From a basic performance standpoint these things are very similar. They just make different assumptions about your data and the trade-offs you want to make.
对于 MongoDB,您不会使用嵌入式文档作为作者记录。所以预连接已经结束,需要多次访问数据库。但是,您可以缓存作者,并且只需为每个记录进行第二次访问。您指出的查询在 MongoDB 中非常简单。
如果您使用 ORM/ODM 来管理应用程序中的实体,这将是透明的。不过,这将是两次数据库之旅。不过,它们的反应应该很快,两次点击根本不应该被注意到。
查找给定作者的文章只是相反的...
因此,两次查询但单个作者的获取应该很快并且可以轻松缓存。
最后,再次提出两个查询,但只是稍微复杂一点。您可以在 mongo shell 中运行这些命令以查看结果。
我不确定这是否值得编写一个地图缩减来完成。几次快速往返可能会有更多的延迟,但 mongo 协议相当快。我不会对此过于担心。
最后,这样做的实际性能影响...由于理想情况下您只会查询文档中的索引字段,因此它应该非常快。唯一的额外步骤是第二次往返以获取其他文档,具体取决于您的应用程序和数据库的结构,这可能根本不是什么大问题。您可以告诉 mongo 仅分析超过给定阈值(打开时默认为 100 或 200 毫秒)的查询,这样您就可以随着数据的增长密切关注程序所花费的时间。
RDMS 不提供的功能是更容易分解数据。当您将应用程序扩展到 CMS 之外以支持其他事物但使用相同的身份验证存储时,会发生什么情况?它现在恰好是一个完全独立的数据库,在许多应用程序之间共享。跨数据库执行这些查询要简单得多 - 对于 RDMS 存储,这是一个复杂的过程。
我希望这对您的 NoSQL 发现有所帮助!
For MongoDB, you wouldn't use embedded documents for the author record. So the pre-join is out, it's multiple trips to the DB. However, you can cache the author and only need to make that second trip once for each record. The queries you indicated are pretty trivial in MongoDB.
If you are using an ORM/ODM to manage your entities within your application, this would transparent. It would be two trips to the db though. They should be fast responses though, two hits shouldn't be noticeable at all.
Finding articles by a given author is just reverse...
So again, two queries but the single author fetch should be fast and can easily be cached.
Lastly, again, two queries but just a tiny bit more complex. You can run these in a mongo shell to see what the results might be like.
I'm not sure this is worth writing a map reduce to complete. A couple quick round trips might have a little more latency but the mongo protocol is pretty fast. I wouldn't be overly worried about it.
Lastly, real performance implications of doing it this way... Since ideally you'd only be querying on indexed fields in the document, it should be pretty quick. The only additional step is a second round trip to get the other documents, depending how your application and db is structures, this is likely not a big deal at all. You can tell mongo to only profile queries that take over a given threshold (100 or 200ms by default when turned on), so you can keep an eye on what's taking time for your program as data grows.
The one befit you have here that an RDMS does not offer is much easier breaking apart of data. What happens when you expand your application beyond CMS to support other things but uses the same authentication store? It just happens to be a completely separate DB now, that's shared across many applications. It's much simpler to perform these queries across dbs - with RDMS stores it's a complex process.
I hope this helps you in your NoSQL discovery!