Bigtable实际例子
有人可以提供一个真实世界的例子来说明如何在 Bigtable 中构建数据吗?请从搜索引擎、社交网络或任何其他熟悉的角度进行讨论,清楚、务实地说明该行 -> 是如何进行的。列族->列组合优于传统的规范化关系方法。
Can someone provide a real-world example of how data would be structured within a Bigtable? Please talk from a search engine, social networking or any other familiar point of view which illustrates clearly and pragmatically how the row -> column family -> column combo is superior to traditional normalized relational approaches.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
阅读原始 Google 白皮书很有帮助:
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf
Google 数据架构信息源的综合列表:
http://highscalability.com/google-architecture
更新: 2014 年 11 月 4 日
Google 白皮书 PDF 的新版本可在此处找到:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
Reading the original Google white paper was helpful:
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf
As was this comprehensive list of information sources on Google data architecture:
http://highscalability.com/google-architecture
Update: 11/4/14
A new version of the Google white paper PDF can be found here:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
我认为差异更多在于查询数据的方式而不是存储数据的方式。
关系数据库和 NoSQL 之间的主要区别在于,后者没有 SQL。
这意味着您(而不是查询优化器)自己编写查询计划。
如果您知道如何执行此操作,这可能会提高查询性能。
考虑一个典型的搜索引擎查询:查找包含所有(或部分)单词的前
10
页,例如“湿 T 恤竞赛”,按相关性排序(为了简单起见,我们将单词邻近性放在一边清酒)。为此,您需要将所有单词拆分并保存在按
(单词、相关性、来源)
排序的可搜索和可迭代列表中。然后,您将此列表划分为(3 *ranks)
组(每个组从搜索查询中给定排名的单词顶部开始),其中ranks
是可能的数字或排名,例如1
到10
;并加入source
上的集合,.在关系数据库中,它看起来像这样:
最好在按等级划分的三个集合上使用 MERGE JOIN 来执行。
然而,据我所知,没有一个优化器会构建这个计划(不考虑
relevance_formula
可能不会分布在各个排名上的事实)。要解决此问题,您应该实施自己的查询计划:从每个单词/排名对的顶部开始,同时降低所有三个集合,跳过缺失值并使用
search
而不是如果您觉得其中一组中有太多内容需要跳过,请继续下一步
。因此,关系方法为您提供了一种更方便的数据查询方法,但代价可能是性能损失。
如果您正在开发校园 Web 服务器,那么编写这些
SELECT *
就可以了,即使它们的执行时间比可能的时间长一微秒。但如果您正在开发 Google,则值得花一些时间来优化查询(纯关系系统只允许使用SQL
访问其数据,这是不允许的)。NoSQL 和关系数据库有时会相互扩散。例如,
Berkeley DB
是一种著名的NoSQL
存储引擎,MySQL
使用它作为其存储后端,以允许SQL< /code> 查询。反之亦然,HandlerSocket 允许对关系型
InnoDB
存储进行纯键值查询,并在其上构建MySQL
数据库。I believe the difference is more about the way the data are queried rather the way they are stored.
The main difference between relational databases and
NoSQL
is that there is, um, noSQL
in the latter.This means you (not the query optimizer) write the query plans yourself.
This may increase the query performance if you know how to do that.
Consider a typical search engine query: find top
10
pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).To do this, you need all words split and kept in a searchable and iterable list ordered by
(word, relevance, source)
. Then you partition this list into(3 * ranks)
sets (each starting at the top of the words in your search query at a given rank), whereranks
is the possible number or ranks, say,1
to10
; and join the sets onsource
, .In a relational database it would look like this:
This is best executed using a
MERGE JOIN
over the three sets partitioned by rank.However, no optimizer I'm aware of will build this plan (leaving aside the fact that
relevance_formula
may not distribute over the individual ranks).To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using
search
rather thennext
if you feel that there will be too much to skip in one of the sets.Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.
If you are developing a campus web server, then writing those
SELECT *
is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data usingSQL
just would not let to do).The such called
NoSQL
and relational databases sometimes diffuse into each other. For instance,Berkeley DB
is a well-knownNoSQL
storage engine which was used byMySQL
as its storage backend to allowSQL
queries. And vice versa,HandlerSocket
allows pure key-value queries to a relationalInnoDB
store with aMySQL
database built over it.