关于非关系型数据库(NoSQL)的问题
虽然我还没有使用过任何新的 NoSQL 数据库,但我已经尝试通过阅读 Wikipedia 文章、博客以及查看一些 NoSQL DB 文档来了解情况。
我刚刚(重新)阅读了 2009 年 8 月版的 php|architect,特别是关于非关系数据库的文章,我的脑海中突然出现了一些问题,我知道这篇文章对这个主题的介绍相当简单,但它是足以让我感到困惑...
CouchDB
我关于 CouchDB 的主要问题是为什么如此大肆宣传?。据我了解,CouchDB 提供了一个 Web 服务,允许您在数据库内创建数据库和文档,这些文档可以具有多个 JSON 编码的属性,并且还具有特殊的 _id
和 _rev
用于跟踪文档修订的属性。
我真的没有对此大惊小怪,几年前,在一个宠物项目中,我编写了一个类似的(?)系统来存储文档,其结构是这样的:
documents/
document-name/
(revision) timestamp/
(contents) md5-hash.txt
PHP Serialized Data
我确信我错过了一些非常基本的东西,否则(从 PHP 开发人员的角度来看)这将具有与 CouchDB 相同的优点并且速度更快 - 无需编码和解码 JSON。
Amazon SimpleDB
现在这个真的让我头晕……作者(Russell Smith)给出了以下示例:
$sdb->putAttributes('phparch', 'may', array('title' => array('value' => 'May 2009'), 'have' => array('value' => false)));
$sdb->putAttributes('phparch', 'june', array('title' => array('value' => 'June 2009'), 'have' => array('value' => true)));
$sdb->putAttributes('phparch', 'july', array('title' => array('value' => 'July 2009'), 'have' => array('value' => true)));
然后他说 Amazon 现在支持类似 SQL 的接口,然后执行以下查询:
$sdb->select('phparch', 'SELECT * FROM phparch WHERE have = "1"');
他没有给出任何类似的例子来说明如何在 CouchDB 中执行该查询(但是他在视图和 Map/Reduce 上留下了一些提示),但我认为这也是可能的,所以我的问题是:亚马逊(和CouchDB)会这样做吗?
我的第一个猜测是他们打开所有文档(可能在分布式环境中),然后应用reduce操作来过滤属性不匹配的文档搜索条件,但即使在并行计算中,这是否也过于昂贵(CPU 和磁盘 I/O)?
我知道我忽略了一些重要的东西,比如分布、一致性等,但我只是试图掌握 NoSQL 存储的基本内部工作原理。
PS:另外,谁能解释一下为什么 CouchDB 和 Amazon SimpleDB 都是用 Erlang 构建的?
Although I've not yet used any of the new NoSQL databases I've tried to keep myself informed by reading Wikipedia articles, blogs and the peeking into some of the NoSQL DBs documentation.
I've just (re)read the August 2009 edition of php|architect, specifically the article about the Non-Relation Databases and a few questions popped up in my head, I understand that the article is pretty light on the subject but it was enough to get me confused...
CouchDB
My main question regarding CouchDB is why so much hype?. From what I understood CouchDB provides a Web Service that lets you create databases and documents inside the database, the documents can have several JSON-encoded attributes and also have a special _id
and _rev
attribute for tracking revisions of the document.
I really don't get all the fuss about this, some years ago for a pet project I coded a similar (?) system for storing documents and the structure was something like this:
documents/
document-name/
(revision) timestamp/
(contents) md5-hash.txt
PHP Serialized Data
I'm sure I'm missing something very fundamental, otherwise (from the viewpoint of a PHP developer) this would have the same benefits as CouchDB and be faster - no need to encode and decode JSON.
Amazon SimpleDB
Now this one really gets my head spinning... The author (Russell Smith) gives the following example:
$sdb->putAttributes('phparch', 'may', array('title' => array('value' => 'May 2009'), 'have' => array('value' => false)));
$sdb->putAttributes('phparch', 'june', array('title' => array('value' => 'June 2009'), 'have' => array('value' => true)));
$sdb->putAttributes('phparch', 'july', array('title' => array('value' => 'July 2009'), 'have' => array('value' => true)));
He then says that Amazon now supports a SQL-like interface and then executes the following query:
$sdb->select('phparch', 'SELECT * FROM phparch WHERE have = "1"');
He doesn't give any analogous example of how to do that query in CouchDB (he leaves some hints on Views and Map/Reduce however) but I suppose it is also possible, so my question is: how does Amazon (and CouchDB) do it?
My first guess would be that they open all documents (in possible in a distributed environment) and then apply a reduce operation to filter the documents whose attributes don't match the search criteria, but wouldn't this be overly expensive (CPU and Disk I/O) even in parallel computing?
I know I'm ignoring some important stuff like distribution, consistency and so on but I'm just trying to grasp the very basic inner workings of NoSQL storages.
PS: Also, can anyone explain me why both CouchDB and Amazon SimpleDB are built with Erlang?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
围绕 nosql 的争论归结于索引、可用性和可扩展性。如果您想获取 has = 1 的文档,则索引允许面向文档的存储不打开所有文档。可用性和可扩展性使这些系统能够轻松横向扩展,并在面对不可靠的硬件时保持稳健。
erlang 是为多处理器系统设计的,因此也非常适合分布式系统。
the fuss around nosql is down to indexing, availability, and scalability. indexing is what allows the document-oriented stores to NOT open all documents if you want to get the ones where have = 1. availablity and scalability allow these systems to easily scale out and be robust in the face of unreliable hardware.
erlang is designed for multi-processor systems and so is an ideal fit for distributed systems too.