BigTable、SimpleDB 等数据库的优点
Google BigTable 和 Amazon SimpleDB 等新学校数据存储范例是专门为可扩展性等而设计的。 基本上,禁止连接和非规范化是实现这一目标的方法。
然而,在这个主题中,共识似乎是大型表上的联接不会不一定要太昂贵,而且非规范化在某种程度上被“高估”了 那么,为什么上述系统不允许连接并将所有内容强制放在一个表中以实现可扩展性? 这些系统中需要存储的数据量是否巨大(数 TB)?
数据库的一般规则是否根本不适用于这些规模? 是因为这些数据库类型是专门为存储许多相似的对象而定制的吗?
或者我错过了一些更大的图景?
New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.
In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
分布式数据库并不像 Orion 暗示的那么简单; 在优化分布式数据集上的完全关系查询方面已经做了很多工作。 您可能想了解 Teradata、Netezza、Greenplum、Vertica、AsterData 等公司正在做什么。 (甲骨文最终也加入了这个游戏,他们最近宣布了这一消息;微软以以前称为 DataAllegro 的公司的名义购买了他们的解决方案)。
话虽这么说,当数据扩展到 TB 级时,这些问题就变得非常重要。 如果您不需要从 RDBM 获得严格的事务性和一致性保证,那么非规范化而不进行连接通常要容易得多。 特别是当您不需要太多交叉引用时。 特别是如果您不进行临时分析,而是需要通过任意转换进行编程访问。
非规范化被高估了。 仅仅因为这就是您处理 100 Tera 时发生的情况,并不意味着每个从未费心去了解数据库并且由于糟糕的架构规划和查询优化而在查询一百万或两行时遇到困难的开发人员都应该使用这个事实。
但如果你在 100 Tera 范围内,无论如何......
哦,这些技术引起轰动的另一个原因是——人们发现有些东西一开始就不属于数据库,并且意识到它们不处理其特定领域中的关系,而是处理基本的键值对。 对于不应该存在于数据库中的东西,完全有可能 Map-Reduce 框架或某些持久的、最终一致的存储系统就是这样的东西。
在较小的全球范围内,我强烈推荐 BerkeleyDB 来解决此类问题。
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
我对他们不太熟悉(我只读过与其他人相同的博客/新闻/示例),但我的看法是他们选择以可扩展性的名义牺牲很多正常的关系数据库功能 -我会尝试解释一下。
假设您的数据表中有 200 行。
在 Google 的数据中心中,其中 50 行存储在服务器 A 上,50 行存储在服务器 B 上,100 行存储在服务器 C 上。此外,服务器 D 包含服务器 A 和 B 中数据的冗余副本,服务器 E 包含服务器 C 上数据的冗余副本。
(在现实生活中,我不知道会使用多少台服务器,但它的设置是为了处理数百万行,所以我想会有相当多的服务器)。
为了“select * where name = 'orion'”,基础设施可以向所有服务器触发该查询,并聚合返回的结果。 这使得它们可以在任意数量的服务器上线性扩展(仅供参考,这几乎就是映射缩减),
但这意味着您需要进行一些权衡。
如果您需要对某些数据进行关系联接,这些数据分布在 5 台服务器上,那么每台服务器都需要为每行从彼此提取数据。 当您有 200 万行分布在 10 台服务器上时,请尝试这样做。
这导致权衡#1 - 没有连接。
此外,根据网络延迟、服务器负载等,您的一些数据可能会立即保存,但有些可能需要一两秒。同样,当您有数十台服务器时,这会变得越来越长,并且正常的方法“每个人都等着,直到最慢的人完成”不再被接受。
这导致了权衡#2 - 您的数据在写入后可能并不总是立即可见。
我不确定还有哪些其他权衡,但我首先想到的是主要的两个。
I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.
Imagine you have 200 rows in your data-table.
In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.
(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).
To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)
This however means you need some tradeoffs.
If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.
This leads to tradeoff #1 - No joins.
Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.
This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.
I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.
所以我得到的是整个“非规范化,无连接”哲学的存在,不是因为连接本身不能在大型系统中扩展,而是因为它们实际上不可能在分布式数据库中实现。
当您存储单一类型的大部分不变数据时(就像 Google 那样),这似乎相当合理。 我走在正确的轨道上吗?
So what I'm getting is that the whole "denormalize, no joins" philosophy exists, not because joins themselves don't scale in large systems, but because they're practically impossible to implement in distributed databases.
This seems pretty reasonable when you're storing largely invariant data of a single type (Like Google does). Am I on the right track here?
如果您谈论的是几乎只读的数据,则规则会发生变化。 在数据发生变化的情况下,非规范化是最困难的,因为所需的工作会增加并且锁定问题也会更多。 如果数据几乎没有变化,那么非规范化就不是什么大问题。
If you are talking about data that is virtually read-only, the rules change. Denormalisation is hardest in situations where data changes because the work required is increased and there are more problems with locking. If the data barely changes then denormalisation is not so much of a problem.
Novaday 您需要为数据库寻找更多的互操作环境。 更常见的是,您不仅需要关系型数据库(如 MySQL 或 MS SQL),还需要大数据场(如 Hadoop)或非关系型数据库(如 MongoDB)。 在某些情况下,所有这些数据库将在一个解决方案中使用,因此它们的性能在宏观范围内必须尽可能相同。 这意味着,您将无法使用 Azure SQL 作为关系数据库,以及一台具有 2 个内核和 3GB RAM 的 VM 用于 MongoDB。 您必须扩展您的解决方案并在可能的情况下使用数据库即服务(如果不可能,则在云中构建您自己的集群)。
Novaday You need to find more interoperational environment for databases. More frequently You don't need only an relational DBs, like MySQL or MS SQL but also Big Data farms as Hadoop or non-relational DBs like MongoDB. In some cases all those DBs will be used in one solution so their performance must be as equal as possible in macro scale. It means, that You will not be able to use let say Azure SQL as relational DB and one VM with 2 cores and 3GB of RAM for MongoDB. You must scale-up Your solution and use DB as a Service when it is possible (if it is not possible, then build Your own cluster in a cloud).