为什么键值对 noSQL 数据库比传统关系数据库更快
有人建议我研究键/值对数据系统来替换我一直在使用的关系数据库。
我不太明白的是这如何提高查询效率。据我了解,您将丢弃大量有助于提高查询效率的信息,只需将您的结构数据库转换为一个长长的键和值列表?
我完全没有抓住要点吗?
It has been recommended to me that I investigate Key/Value pair data systems to replace a relational database I have been using.
What I am not quite understanding is how this improves efficiency of queries. From what I understand you are going to be throwing away a lot information that would help to make queries more efficient, by simply turning your structure database into one big long list of keys and values?
Have I missed the point completely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
关系数据库的主要优点是能够关联和索引信息。大多数“NoSQL”系统不提供关系代数或出色的查询语言。
您需要问自己的是,切换对于我的预期用例有意义吗?
你有点错过了重点。关键是,有时您没有索引(无论如何,就像您对一般关系数据库所做的那样)。即使您确实有索引,将其关联在一起的能力也很困难,而这正是关系数据库所擅长的。 NoSQL 解决方案具有许多新颖的结构,使许多用例变得非常简单,例如 Redis 是一种面向数据结构的数据库,非常适合使用队列或其发布-订阅架构快速构建任何内容。 MongoDB 是一个自由格式的文档数据库,它将文档存储为 JSON (BSON),并且擅长快速开发。 BigTable 解决方案的结构比这要少一些,但将行的概念扩展为具有列族 - 每行中包含的键值对在磁盘上有效排列。您可以使用 ElasticSearch 等技术在此基础上构建倒排索引。
并非所有事物都需要传统 RDBMS 的一致性保证或磁盘布局。 NoSQL 的另一个主要用例是大规模可扩展性,许多解决方案(例如 BigTable - HBase/Cassandra)都被设计为轻松水平分片和扩展(使用 SQL 就不那么容易了!)。 Cassandra 专为无单点故障而设计。此外,面向列的数据存储旨在通过顺序读取来优化磁盘速度(并减少写入放大 )。话虽这么说,除非您确实需要它,否则传统的 SQL 服务器通常就足够了。
有优点也有缺点。就我个人而言,我混合使用两者。使用正确的工具完成正确的工作,最终可能是 PostgreSQL 或 MySQL。
您可以将基本键值系统比作创建一个包含两列(唯一键和值)的 SQL 表。这速度相当快。您无需对数据进行任何关系、关联或整理。只需找到该值并返回即可。这过于简单化了,NoSQL 数据库确实有很多有趣的功能和应用程序,不仅仅是简单的 K、V 存储。
我不知道您的科学数据是否适合大多数 NoSQL 实现,这取决于数据。如果您查看 HBase 或 Cassandra,它可能很适合科学家的需求(通过适当的行键设计——时间戳不能是第一位,请查看 OpenTSDB)。我知道许多公司通过使用随机顺序分区器和传感器的 UUID 将读数汇总到每日脂肪行中,将传感器读数存储在 Cassandra 中。每天都会围绕特定用例创建新的数据库,因此答案可能会发生变化。对于特定用例,您可以通过使用特定数据存储来获得巨大的回报,但代价是灵活性和工具。
The key advantage of a relational database is the ability to relate and index information. Most 'NoSQL' systems don't provide a relational algebra or a great query language.
What you need to ask yourself is, does switching make sense for my intended use case?
You have kind of missed the point. The point is, you sometimes don't have an index (in the way you do with a general relational DB anyways). Even when you do have an index, the ability to relate it together is difficult and what relational databases excel at. NoSQL solutions have a number of novel structure which make many usecases trivially easy, e.g. Redis is a data-structure oriented DB well-suited to rapidly building anything with queues or its pub-sub architecture. MongoDB is a freeform document database which stores documents as JSON (BSON) and excels at rapid development. BigTable solutions are a little less structured than that, but expand the idea of a row to have families of columns — key value pairs contained in each row arranged efficiently on disk. You can build an inverted index on top of this with a technology like ElasticSearch.
Not everything needs the consistency guarantees or disk layout of a traditional RDBMS. Another major use case of NoSQL is massive scalability, many solutions (e.g. BigTable -- HBase/Cassandra) are designed to shard and scale horizontally easily (not so easy with SQL!). Cassandra in particular is designed for no SPOF. Further, column-oriented datastores are meant to optimize disk speeds via sequential reads (and reduce write-amplification). That being said, unless you really need it, a traditional SQL server is generally good enough.
There's advantages and disadvantages. Personally, I use a mix of both. Use the right tool for the right job, which may end up being PostgreSQL or MySQL more often than not.
You can liken a basic key-value system to making an SQL table with two columns, a unique key and a value. This is quite fast. You have no need to do any relations or correlations or collation of data. Just find the value and return it. This is an oversimplification, NoSQL databases do have a lot of interesting functionality and application beyond simple K,V stores.
I don't know if your scientific data is well suited to most NoSQL implementations, that depends on the data. If you look at HBase or Cassandra, it may well suit a scientist's needs (with proper rowkey design -- timestamp must not be first, check out OpenTSDB). I know of many companies that store sensor readings in Cassandra by using a random-order partitioner and the UUID of the sensor to roll up readings into daily fat rows. Every day new databases are created around specific use cases, so that answer may change. For specific use cases, you can reap huge rewards for using specific datastores at the cost of flexibility and tooling.
效率来自三个主要方面:
在我看来,如果有人向您提出“我们的新数据对于我们的 RDBMS 来说太大了”的要求,那么要么有数字来支持这一主张,要么承认他们只是想尝试一下新的亮点。 noSQL毫无价值吗?可能不会。它会像 Java 1.0 所宣传的那样彻底颠覆世界吗?可能不会。
研究新事物并没有什么坏处,只是不要把全部赌注押在它们身上,而选择已有 50 年历史的、成熟的、易于理解的技术。
The efficiency comes from three main areas:
To my eye, someone coming to you with a requirement that "our new data will be too much for our RDBMS" ought either have numbers to back that assertion up or admit they just want to try the new shiny. Is noSQL meritless? Probably not. Is it going to turn the world upside-down as Java 1.0 was hyped to? Probably not.
There's no harm in investigating new things, just don't bet the farm on them in favor of 50 year old, well-established, well-understood technology.
在这里,我假设您想要优化一个特定的查询,该查询只是按键查找记录。其中一个示例可能是按用户名查找用户信息记录。对于某些系统来说,这样的查询必须非常快,而所有其他查询都不重要。
数据库性能的最大因素是读/写数据所需的 I/O 操作数量。大多数数据库系统使用类似的数据结构(即 b 树),可以在 O(log(n)) I/O 中检索未缓存的数据。为了提供持久更新,数据必须写入磁盘:大多数系统按顺序执行此操作,这是最快的方法。
那么,Key-Value 存储在哪里可以提高效率呢?
大多数 RDBMS 系统都是构建在看起来像键值存储的东西之上,因此您可以将其视为消除中间人。
Here I'm assuming that you want to optimize one particular query, which is simply looking up a record by key. One example of this might be looking up a userinfo record by username. For some systems a query like that has to be incredibly fast and all other queries are unimportant.
The biggest factor in database performance will be the number of I/O operation required to read/write data. Most database systems use similar data structures (i.e. b-trees) which can retieve uncached data in O(log(n)) I/Os. In order to give durable updates the data will have to be written to disk: most systems do that sequentially, which is the fastest way.
So, where can a Key-Value store get efficiencies?
Most RDBMS systems are built on top of something which looks like a key-value store so you could view this as cutting out the middleman.
上面有很多很好的观察,有时双方的支持者都有点过于热情。让我们回到你原来的问题。假设您在 Cassandra 上进行了设计,并在 RDBMS 上进行了相同的设计。假设您在 Cassandra 中有一组 KV 对,然后在关系上执行一组相同的 KV 对。 (实际上可以做到这一点 - 例如,作为关系上的完全非规范化的名称值对)。即便如此,由于关系 DBMS 的开销(日志记录、目录访问、完整性检查、事务原子性等),关系数据库的运行速度会变慢。此外,在列族数据存储中,数据是按字典顺序排序的;它不是相关的。我相信有几个社交网站做到了这一点,他们在两者上构建了相同的结构,但关系型网站速度较慢。重要的是要记住,在用户查询产品数据库后,查看谁也购买了这个或那个,构建他们的购物车和愿望清单,所有这些都将在 NOSQL 上完成,当用户点击结账按钮时,交易将在关系数据库上运行。为什么我们所谓的专家不能意识到,在这场数据库争论中,这不是一个对立的问题,而是关系型数据库有一席之地,就像 NOSQL、图形、倒排数据库、多维数据库等甚至数据库一样。文件。
There are a lot of good observations above and sometimes a little too much passion on both sides by both proponents. Let's get back to your original question. Suppose you do a design on Cassandra and do an identical design on an RDBMS. Say you have a set of KV pairs in Cassandra, and go and do an identical set of KV pairs on relational. (It is actually possible to do this - say, as a fully denormalized name value pair on relational). Even so, relational will run slower simply because of the overhead of the relational DBMS - logging, catalog access, integrity checking, transaction atomicity, etc. In addition, in column family data store the data is lexigraphically sorted; it is not in relational. I believe that several of the social networking sites did this, they built identical structures on both, but relational was slower. It is important to remember that after a user queries the product database, looks at who also bought this or that, builds their shopping cart and their wishlist, all of which will be done on NOSQL, when the user hits the checkout button, the transaction will be run on a relational database. Why can't we so-called experts realize it is not one versus the other in this database debate, but rather that there is a place for relational, as there is for NOSQL, graph, inverted column databases, multidimensional, etc. and even files.